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1 
What This Book Is About 


Program evaluation derives from the commonsense idea that 
social programs should have demonstrable benefits. Literacy 
programs for adults, for example, should lead to measurable 
improvements in reading skill. Lowering speed limits on inter- 
state highways should plainly reduce the number of automobile 
fatalities and save gasoline. Increasing the length of prison sen- 
tences for white-collar crime should clearly reduce the amount 
of insider trading. Increasing the price of electricity during the 
middle of the day should visibly reduce consumption during 
"peak load" hours. Efforts to educate sexually active individuals 
about "safe sex" should plainly slow the spread of AIDS. Implicit 
is the notion that social programs ought to have explicit aims by 
which success or failure may be empirically judged. Mere asser- 
tions about success or failure are insufficient. The assertions 
must be supported by evidence. 

It should not be surprising, therefore, that program evalua- 
tion, broadly construed, has a long history. In ancient Rome, for 
instance, tax policies were altered in response to observed 
fluctuations in revenues. During the last decades of the eigh- 
teenth century, the British Admiralty began requiring that its 
crews drink citrus juice on long voyages after evidence was pro- 
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in the 
duced showing that citrus juice prevented SES ow D 
twentieth century, the indeterminate. prison erede. 
introduced in the United States, partly in signa ipee 
rates of recidivism under earlier sentencing policies. 


ctive or 
judgments have always been made about whether prospe : 
ongoing programs are effective. 

In recent years, however, 
has evolved into "e 


how well programs are 
m impact, and the analy- 


€cause the particular 
with the most effect 
designing a Program, it is always important to know what the 
target population is. T! 


(e-8., unemployed teenagers) typically is best determined by sur- 
vey procedures, 


Finally, evaluation research ca 
and empirical ge 
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and that evaluation research has something to offer. That is, 
without interest and funding from organizations or agencies 
who have a stake in whether a particular program is working, 
evaluation research would soon wither away. At its best, how- 
ever, evaluation research can only help policymakers make judg- 
ments about the relative success or failure of programs and 
policies, whether these be prospective or in operation. Evalua- 
tion research is not a substitute for policymaker judgments, and 
responsible evaluators have no interest in either circumventing 
the political process or becoming central players. 

Put another way, evaluation research is essentially about the 
provision of the most accurate information practically possible 
in an evenhanded manner. For example, an evaluation study 
might determine the likely impact of a program providing infor- 
mation about sexually transmitted diseases to adolescent 
schoolchildren but leave unaddressed the political question of 
whether the schools should make such programs mandatory. 
Similarly, an evaluation might estimate the degree to which 
charges for the treatment of waste water would deter manufac- 
turers from polluting but be silent on the fairness of such pric- 
ing policies. Or an evaluation might determine that bottle-ban 
initiatives really reduce litter but take no position on whether 
such bans are an unreasonable interference with a free market. 

So, what then is a successful evaluation? To anticipate a bit, 
an evaluation attains practical perfection when it provides the 
best information possible on the key policy questions within the 
given set of real-world constraints. This implies that al] evalua- 
tions are flawed if measured against the yardstick of abstract 
perfection or if judged without taking time, budget, ethical, and 
political restrictions into account. In other words, there is really 
no such thing as a truly perfect evaluation, and idealized text- 
book treatments of research design and analysis typically estab- 
lish useful aspirations but unrealistic expectations. 

A “merely” successful evaluation, in contrast, falls short of 
providing the best information possible under the given circum- 
stances but provides better information than would otherwise 
have been available. That is, the proper measure of "success" is 
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current knowledge, not what ultimately might be n ne 
Thus, if very little is known about the effectiveness o ^ eed 
lar program, a relatively weak evaluation on a pure me arsed 
cal scale may nevertheless be an enormous success in 5 2 
For example, if virtually nothing is known about whet id 
petrators of family violence can be deterred by an arrest, a z p 
(flawed) evaluation may be extremely successful (Sherma 
1989). T 
us ins nothing is being said about how the evaluation e 
ultimately used. Indeed, an evaluation may be successful aids 
the information provided is ignored, or even misused. Once n 
findings are presented in a clear and accessible fashion, t ie 
evaluation is over. What follows is certainly critical, but E 
essentially a political process. Interested evaluators are best 0 


observing the action at some distance, preferably through heavy 
lenses. 


Goals and Organization of 
This Book 


This book provides an 
for which evaluation res 
methods that are curre 
given to provide concret 


introduction to the variety of purposes 
earch may be used and to the range of 
ntly employed, Specific examples are 
e illustrations of both the goals of evalu- 
ation researchers and the methods used. Although the book 1 
intended to be comprehensive in the sense of describing major 
uses of evaluation research, it cannot pretend to be encyclo- 
Pedic. Citations to more detailed discussions are provided. In 
addition, there are several general references that survey the 


field of evaluation in a more detailed fashion (Suchman 1967; 
Weiss 1972, Cronba 


ch and Associates 1980; Rossi and Freeman 
1989; Cronbach 1982, 
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the evaluation research enterprise is presented in an idealized, 
chronological fashion to emphasize that the research methods 
employed depend on the empirical question being asked and the 
evolution of the social program under scrutiny. For example, 
research procedures that might well make sense as a program is 
initially being designed may be ineffective when the impact of 
an ongoing program is being addressed. Likewise, research pro- 
cedures that are effective in determining how a program works 
will often differ from research procedures that are effective in 
determining whether a program works. In short, our message is 
pragmatic; research tools should be chosen for the particular job 
at hand. 
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Key Concepts in 
Evaluation Research 


of central concepts in evaluation 
intellectual roots of evaluation 
sciences, social science concepts 
minate. All social science fields 
velopment of evaluation research 
; therefore, that the best evaluation 
uators draw on a number of dis- 
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tions are almost exclusively concerned with making judgments 
about policies and programs that are on the current agenda of 
policymakers (broadly construed to include a wide variety of 
"players," not just public officials). Clearly, the policy space is 
time and space bound and does not encompass a permanently 
fixed set of policies and programs; it changes over time and it var- 
ies over political jurisdictions. For example, in the 1960s, the 
national policy space in the United States included direct 
income support, in the form of a “negative income tax,” for 
households falling below the poverty line. In response, a number 
of evaluation projects explored what the impact of such support 
might be. In the 1980s, the national policy space no longer 
includes a negative income tax. Likewise, in the middle 1970s 
communities across the State of California were considering a 
wide variety of water conservation programs because of a serious 
drought. By the early 1980s, other problems dominated the local 
policy space, in part because the drought had passed. 

It is the almost exclusive attention to matters in current pol- 
icy space that distinguishes evaluation research from academic 
social science, and a good evaluation researcher knows how to 
determine what is in the policy space and what is not. For exam- 
ple, an academic social scientist might study the "urban under- 
class" as an intellectual matter and may in addition be genuinely 
concerned about their plight. In contrast, the evaluation re- 
searcher would focus on the current policy debates and especially 
social interventions that are being contemplated or are already in 
place. Still more concretely, the academic might have a long-stand- 
ing interest in theories of segmented labor markets and undertake 
a study of the causes of teenage unemployment to test compet- 
ing theories, The evaluator could certainly draw on insights from 
such research but might concentrate, for instance, on the impact 
of a particular job training program for unemployed teenagers. 


Stakeholders 


By virtue of its engagement in policy space matters, evalua- 
tion research is saturated with political concerns. The outcome 
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of an evaluation can be expected to attract the attention of per- 
Sons, groups, and agencies who hold stakes in the outcome. 
These “stakeholders” include policymakers on the executive and 
legislative levels; the agencies and their officials who ad- 
minister the policies or programs under scrutiny; the persons 
who deliver the services in question; often, groups representing 
the targets or beneficiaries of the programs, or the targets or 
beneficiaries themselves; and sometimes taxpayers and citizens 
generally. In almost all Program issues, stakeholders may be 
aligned on Opposing sides, some favoring the program and some 
Opposing. And whatever the outcome of the evaluation may be, 
there are usually some who are pleased and some who are disap- 
Pointed: It is usually impossible to please everyone. For exam- 
ple, an evaluation showing the benefits of allowing convicts to 
be employed by private firms (e.g., to manufacture furniture] 
might be strongly endorsed by prisoners’ rights groups, prison 
officials, and local chambers of commerce but be roundly criti- 


cized by law enforcement groups, law-and-order legislators, and 
labor unions, 


eee an evaluation report ordinarily is not regarded as a 
à. document: Rather, it is scrutini i 

rutiniz inutely, b 

stakeholders wh i ed fee 


Teport’s implications! O 
research should not to be 
avoid controversy, or who 
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that shortly). Alleged methodological errors are easy targets.* A 
third implication is that the conduct of evaluation research 
often involves careful prior negotiations with stakeholders. An 
evaluation of a within-school educational program will be seri- 
ously impeded, for example, if a teachers’ organization recom- 
mends that its members not cooperate with the evaluator. 


Program Effectiveness: 
Three Meanings 


While the importance of the political environment in which 
evaluation research is undertaken is hard to overemphasize, 
political matters are hardly the whole story. A mixed bag of 
legitimate technical skills are the evaluator’s ticket of admis- 
sion and in the end justify his or her keep. We turn, then, to tech- 
nical matters, beginning with conceptions of the proverbial bot- 
tom line: program effectiveness. 

In the broadest sense, evaluations are concerned with 
whether or not programs or policies are achieving their goals 
and purposes. Discerning the goals of policies and programs is 
an essential part of an evaluation and almost always its starting 
point. However, goals and purposes are often stated vaguely, 
typically in an attempt to garner as much political support as 
possible. Programs and policies that do not have clear and con- 
sistent goals cannot be evaluated for their effectiveness. In 
response, a subspecialty of evaluation research, evaluability 
assessment, has developed to uncover the goals and purposes of 
policies and programs in order to judge whether or not they can 
be evaluated. 

Insofar as goals are articulated, "effectiveness" is the extent to 
which a policy or program is achieving its goals and purposes. In 
practice, it cannot be overemphasized that the concept of effec- 
tiveness must always address the issue: “compared with what?” 
For marginal effectiveness the issue is dosage; the consequences 
of more or less of some intervention are assessed. For example, 
one might study whether decreasing by one-half the ratio of 
grade school students to their teachers later doubles student 
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performance on standardized reading tests. For deris va 
tiveness, the contrast is between a program and the a inis ae 
the program or between two or more program iau EN 
example, one might compare the impact on the number o os 
Cer screenings generated by public service we E 
versus the number generated by mass mailings of vu á 
both containing the same educational information. Finally, E 
is common to consider effectiveness in dollar terms: Rel ; 
effectiveness. Comparisons are made in units of outcome m 
dollar. For example, vaccinating the elderly for influenza oen 
probably be less effective in reducing the number of Miei E 
fatalities for all age groups than vaccinating everyone regar he 
of age. However, focusing on the elderly may be more co x 
effective because, with mass vaccinations, a large number s 
people would be vaccinated who were not significantly at risk. 
That is, the cost per life saved would be lower. 


Validity 
It is one thing to pro 


ness and quite another t 
gram is effective. And 


perly conceptualize program ees 
o determine empirically whether a pre 
determining effectiveness depends, s 
turn, on the validity of the evaluation. In other words, € 
tion research shares with other research activities the overt! 


ing goal of achieving high validity. Little is learned from evalua- 
tions with low validity. 


Broadly stated 
by which th 
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to emphasize four kinds of validity: construct validity, internal 
validity, external validity, and statistical conclusion validity 
(Cook and Campbell, 1979). We will return to the four kinds of 
validity after laying a bit more groundwork. 

Ideally, policymakers are seeking a binary assessment about 
a social program: thumbs up or thumbs down. Either the pro- 
gram works or it does not. In addition, they are ideally seeking 
a specific number indicating how effective the program is. Thus 
a prison vocational training program might reduce recidivism 
by 15%. Or a nutrition program for pregnant women in low- 
income neighborhoods may increase the birth weight of infants 
by an average 2.3 pounds. Or a company's affirmative action pro- 
gram may increase by 15 the number of Blacks and Latinos 
hired. As just noted, however, the world of program evaluation is 
never that simple. A1] assessments come with healthy amounts 
of uncertainty, and evaluation results necessarily have varying 
amounts of credibility. To be sure, studies with greater validity 
provide more credible results, but some uncertainty will always 
remain. That is, evaluation findings are not right or wrong, but 
more or less credible. Typically, the uncertainty is expressed in 
how the role of chance is represented but that is hardly the 
whole story (see below]. 

It is perhaps important to stress, as well, that the uncertainty 
in evaluation results is inherent in the social phenomena being 
studied and no research methodology, even the ideal, can 
remove it. However, stronger research methods typically reduce 
the amount of uncertainty. 


Measurement and Construct Validity 


Measurement is nothing more than a systematic procedure to 
assign (real) numbers to objects. "Age," for example, may be mea- 
sured by the number of years between birth and the present. 
"Prior record” may be measured by a “1” if there is a previous con- 
viction and a "0" if there is no previous conviction. "Attitudes 
toward water conservation" may be measured by a "3," "2," or "]" 
depending, respectively, on whether a person answers "agree," 
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"uncertain," or "disagree" to a survey question on the importance 
of installing water-saving appliances. 


than Whites because Blac 


underestimated by the commonly used tests. When the meas- 
urement error is random (or “noise”), the measures, on the aver 
ut will be inaccurate to Vary- 
s. That is, the measured IQ for 
sure of intelligence, but if the 
mber of times to each person 


Key Concepts in Evaluation Research 19 


ever, that random measurement error can be very damaging. 
When the random measurement error is in the outcome vari- 
able(s) of interest, "noise" can obscure real treatment effects. 
That is, real results may be overlooked. When the random meas- 
urement error is in the treatment (e.g., who got which interven- 
tion) or the control variables (ie., variables whose effects need to 
be disentangled from the effects of the treatment), estimates of 
the treatment effect can be systematically too high or too low. 
That is, estimates of treatment impact will be biased. Whether 
approached as an "errors in variables" problem as in the econo- 
metric literature (e.g., Kmenta 1971; 309-22), as a "latent varia- 
ble" problem as in the psychometric literature (Lord 1980}, or as 
the “underadjustment” problem in the evaluation literature 
(e.g., Campbell and Erlebacher 1970), random error can lead to 
decidedly nonrandom distortions in evaluation results. The role 
of random measurement error is sometimes addressed through 
the concept of “reliability.” 


Causality and Internal Validity 


Many evaluation questions concern causal relations, such as 
whether or not a proposed program encouraging people not to use 
wood-burning stoves on high air pollution days will “cause” reduc- 
tions in air pollution. The literature on causality and causal infer- 
ence is large and, currently, fraught with controversy (e.g., Pratt 
and Schlaifer 1984; Holland 1986; Holland and Rubin 1988; Berk 
1988b). Suffice it to say that by a “causal effect” we mean a com- 
Parison between the outcome had the intervention been 
introduced compared to the outcome had the intervention not 
been introduced. For example, the causal effect of a ban on diesel- 
Powered automobiles might be the amount of nitrogen-based pol- 
lutants in the air had diesel automobile engines been banned com- 
pared to the amount had the ban not been put in place. 

From the definition of a causal effect, it should be apparent 
that, in practice, causal effects cannot be directly observed. One 
cannot observe the amount of nitrogen-based pollutants in the 
air simultaneously with and without the ban on diesel engines 
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in place. Rather, causal effects must be inferred. Thus one might 
try to estimate the causal effect of the ban by comparing air 
quality before the ban to air quality after. Or one might try to 
estimate the causal effect of the ban by comparing air quality in 
an area with the ban to air quality in an area without the ban. In 
the first case, however, one must assume that no other changes 
had occurred that could affect air quality in the interval 
between the earlier and later observational periods. In the sec- 
ond case, one must assume that the two areas are otherwise 
effectively identical on all factors that could influence air qual- 
ity. In short, the need to infer causal effects opens the door to 
inferential errors. 

In practice, therefore, whenever a causal relationship is pro 
posed, alternative explanations must be addressed and, presum- 
ably, discarded. If such alternatives are not considered, one may 
be led to make "spurious" causal inferences; the causal relation- 
ship being proposed may not in fact exist. Sometimes this con- 
cern with spurious causation is addressed under the heading of 
internal validity (Cook and Campbell 1979). For example, any- 
one who claims that an educational TV program improved the 
knowledge of those who viewed it must also consider the alter- 
native explanation that viewers were self-selected persons 


Interested in the topic who would have acquired the same 
amount of information in some ot 


: her way were the program not 
available. 


The consideration of alternative causal explanations for the 
Success of programs is an extremely important consideration 
when plans to collect the data are formulated (Heckman and 
Robb 1985). In the wood-burning example, an observed chang? 
in air pollution after the program went into. effect may have 
been caused by milder weather, improved wood-burning equiP" 
ment, or a rise in cord wood prices leading people to shift to 
other fuels. The social intervention could be totally irrelevant: 

In addition, programs that deal with humans are all more © 
less subject to problems of self-selection, often persons who ps 
most likely to be helped, or who are already on the road to recov" 
ery, are those most likely to participate in a program. 
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Thus vocational training offered to unemployed adults is 
likely to attract those who would be most apt to improve their 
employment situation in any event. Or, sometimes, program 
operators "skim off the cream" among target populations for par- 
ticipation in programs, thereby assuring that such programs 
appear successful. In still other cases, events unconnected with 
the program produce changes that seem to result from the pro- 
gram being evaluated: An improvement in the speed with which 
cases are processed by a county's courts may seem to result from 
the addition of more prosecutors to the local district attorney's 
office, when actually, the improvement may have been caused 
by an unconnected change in plea-bargaining practices. In any 
case, we will have more to say about causal inference later. 


Generalizability and External Validity 


Whatever the empirical conclusions resulting from evalua- 
tion research, it is necessary to consider how broadly one can 
generalize the findings in question; that is, are the findings rele- 
vant to other times, other subjects, similar programs, and other 
program sites? Sometimes such concerns are raised under the 
rubric of external validity (Cook and Campbell 1979). 

It cannot be overemphasized that, if findings cannot be gener- 
alized, they are useless. Policymakers need to know how interven- 
tions of certain kinds work and if those kinds of interventions 
are effective. Knowing how a particular program worked and how 
effective it was by itself has no value, because that program can 
never be exactly duplicated. The best that policymakers can do 
is mount a program that is (more or less) similar to the program 
evaluated. 

Consider, for instance, a program to reduce the consumption 
of electricity during the middle of the day (the “peak load" prob- 
lem) by raising the price of electricity between 10:00 in the 
morning and 4:00 in the afternoon. Suppose the evaluation con- 
vincingly showed that raising the price by 15% led to a drop of 
10% in electricity use during the peak load hours. However, the 
€conomic environment in which the intervention was intro- 
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duced is constantly changing, and this will affect not only E 
base price of electricity on which the 15% increase may be cai- 
culated but the fraction of each consumer's budget that is allo- 
city. For example, if the base price 
o the price of gas, consumers may 


; it is far from obvious what use 


policymakers could make of the evaluation unless one grants 


Some license to generalize. 
The key, 
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tims may only be effective in urban areas where the location of 
the safe house can far more easily be kept secret. Likewise, an 
affirmative action program that might be effective in virtually all 
private universities might fail at public universities. 

It is also common to wonder whether an evaluation's results 
would be applicable to persons who differ from the study's par- 
ticipants in abilities or in socioeconomic background. For 
example, Sesame Street was found to be effective for preschool 
children from lower sociceconomic families but more effective 
for children from middle-class families (Cook et al. 1975). In 
contrast, arresting men who assault their wives seems to deter 
many future assaults regardless of the assailant's age, education, 
or race (Berk and Sherman 1988). The same issues arise, inciden- 
tally, for all kinds of experimental units such as households, 
police departments, and prisons. 

There is also the problem of generalizing over time. For exam- 
ple, Maynard and Murnane (1979) found that transfer payments 
provided by the Gary Income Maintenance Experiment appar- 
ently increased the reading scores of children from the 
experimental families. One possible explanation is that, with 
income subsidies, parents (especially in single-parent families) 
were able to work less and, therefore, spend more time with 
their children. Even if this were true, it raises the question of 
whether similar effects would be found at present, when infla- 
tion is taking a smaller bite out of the purchasing power of 
households. 

Finally; there is the difficulty of generalizing over interven- 
tions, because no two treatments are likely to be identical. Con- 
sider, for instance, the content of a literacy program for adults. 
There are a wide variety of ways literacy may be taught and, 
within these forms, a wide variety of teaching styles, classroom 
arrangements, incentive systems, and teaching materials. Even 
with clear and lengthy guidelines, full standardization is impos- 
sible. Thus literacy programs integrated into more general voca- 
tional training may well have different results from literacy pro- 
grams taught on a stand-alone basis: One cannot generalize 
from one approach to another. 
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Another way of thinking about generalization is to recognize 
that programs vary in their "robustness"; that is, in their ability 
to produce the same results with different operators, different 
clientele, in different settings, and at different historical times. 
Clearly, a "robust" program is highly desirable. For example, 
many medical interventions, such as vaccination programs for 
influenza, are relatively robust because, for purposes of fighting 
disease, medical treatments can often be effectively stan 
dardized and humans tend to respond in a sufficiently homo- 
geneous manner. 

It should be clear that external validity is a vital issue in all 
evaluations, which may be handled well or poorly. Basically, 
there are three devices that evaluators can employ to improve 
external validity. First, an unbiased sample of a defined popula 
tion (e.g., via a probability sample) justifies generalization back 
to that population. Thus findings from a random sample of stu- 
dents from a given high school may be generalized to all stu- 
dents in that school. However, the sampling procedures do not 
by themselves justify generalizations to students in other high 
Schools, even in the same School district. 

Second, replications of a given evaluation may be used to 
incrementally define the boundaries within which generaliza- 
tion is possible. By “replications,” we mean new studies that are 
as similar as possible to the original study for which generaliza- 
tion was problematic. Note that it is the study that is being 
may or may not be replicated. 
For example, an experiment in Minneapolis showing that arrest 


subsequent violent behavior is 


a day or more in jail (awaiting 4 
Other areas leading to almost 
ng. 

g theory or empirical generalizations may b€ 


used for Beneralizing evaluation findings. For example, micro- 
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economic theory asserts that virtually all consumers will 
respond to price increases by buying less of the particular com- 
modity. Hence, an evaluation in a single community showing 
that increasing the price of water leads to reduced residential 
water use may be widely generalized (Berk et al. 1981). Unfor- 
tunately, it is very rare in the social sciences to find theory that 
both is widely accepted and leads to broad generalizations. 


Chance and Construct Validity 


The nature of chance in social phenomena has a long and 
controversial history, but for present purposes, chance plays a 
role whenever uncertainty exists. Basically, there are three 
[probably complementary) perspectives. First, uncertainty may 
result from how the data were collected. Second, uncertainty 
may derive from our ignorance about particular social 
phenomena. Third, uncertainty may be an inherent part of all 
Social (and physical) phenomena. Each of these perspectives on 
the role of chance will be considered below. 

Regardless of which of the three perspectives one favors, it is 
always important that the role of chance be properly taken into 
account. When formal, quantitative findings are considered, 
this is sometimes addressed under the heading of statistical 
conclusion validity (Cook and Campbell 1979), and the problem 
is whether "statistical inference" has been undertaken properly. 
Thus, just as flipping four heads in a row does not necessarily 
mean that a coin is biased (because a fair coin will produce four 
heads in a row once in a while}, finding that students exposed to 
a driver's education course have fewer accidents than those who 
Were not does not necessarily mean that the program was a suc- 
cess. The difference in the number of accidents between stu- 
dents who took a driver's education class and students who did 
not may have been produced by a chance mechanism analogous 
to flipping a coin. Unless the role of such chance factors is 
assessed formally, it is impossible to determine if the program 
effects are real or illusory. 

Similar issues concerning the operation of chance appear in 
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nonquantitative work as well, although formal assessments of 
the role of chance are difficult to undertake in such See, 
Nevertheless, it is important to ask whether the d 
findings rest on observed behavioral patterns that occurred WIE 
sufficient frequency and stability to warrant the pop. 
that they are not "simply" the result of chance. Good et t 
nographers often address the role of chance by collecting lots a 
data, which allows an assessment of whether certain ree 
phenomena occur so often in particular ways that “the luck o 
the draw" can implicitly be ruled out. 

Having provided a brief taste of the issues, we can return to the 
three perspectives on chance. Consider first how evaluation datà 
may be collected. Sampling error occurs whenever one is trying 
to make statements about Some population of interest from obser- 
vations gathered on a subset of that population. For example, one 
might be studying à sample of students from among those attend- 
ing a particular school, a sample of teachers from the popuni 
of teachers in a particular school System, or even a sample 0 
Schools from a population of schools within a city, county, OY 
state. Yet, although it is typically more economical to work with 
samples, the process of sampling necessarily introduces the pros- 
Pect that any conclusions based on the sample may differ from 
Conclusions that might have been reached had the full population 
been studied instead. Indeed, one can well imagine obtaining 
different results from different subsets of the population. j 
cted from a larger population 
be called a "sample," some subsets may 
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ples are called probability samples in which every element in a 
population has a known nonzero chance of being selected (Sud- 
man 1976; Kish 1965}. Probability samples are difficult to execute 
andare often quite expensive, especially when dealing with popu- 
lations that are difficult to locate. Yet there are such clear advan- 
tages to such samples, as opposed to haphazard and potentially 
biased methods of selecting subjects, that probability samples are 
almost always to be preferred over less rational methods. (See Sud- 
man 1976 for examples of relatively simple and inexpensive prob- 
ability sampling designs.) 

Fortunately, when samples are drawn with probability proce- 
dures, disparities between statistics calculated from a sample 
and the respective population values can only result from the 
“luck of the draw,” and with the proper use of statistical infer- 
ence, one can place "confidence intervals" around estimates 
from probability samples, or ask whether a sample estimate 
differs in a "statistically significant" manner from an assumed 
population value. In the case of confidence intervals, one can 
obtain an assessment of how much “wiggle” there is likely to be 
in one’s sample estimates. In the case of significance tests, one 
can reach a decision about whether a sample statistic (e.g., a 
mean SAT score) differs from some assumed value in the popula- 
tion [e.g., 600). For example, if the mean SAT score from a ran- 
dom sample of students differs from some national norm, one 
can determine if the disparities represent statistically signi- 
ficant differences, that is, differences large enough that they 
could not have occurred easily by chance alone. 

A second kind of chance factor associated with data collection 
stems from the process by which experimental subjects may be 
assigned to experimental and control groups. For example, it may 
turn out that the assignment process yields an experimental 
group that, on the average, contains brighter students than the 
control group. This may confound any genuine treatment effects 
With a priori differences between experimentals and controls; here 
the impact of some positive treatment such as self-paced instruc- 
tion will be artificially enhanced because the experimentals were 
already performing better than the controls. 
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Much as in the case of random sampling, for experiments A 
which the assignment to treatment group or control group ! 
undertaken with probability procedures, the role of chance M 
be taken into account. In particular, it is possible to dee 
the likelihood that outcome differences between experimenta E 
and controls are statistically significant. If the disparities E. 
statistically significant, chance (through the assignment a E 
cess) is eliminated as an explanation, and the evaluator can iil 
begin making substantive sense of the results. It is also possi nt 
to place confidence intervals around estimates of the He i 
effect(s) indicating roughly the likely range of the effects, IV 
that any estimate is subject to random variation. E. 

Chance may enter one's data independent of how the data we P 
collected. Rather, it surfaces even if the total population of inte : 
assignment process or sampling procedur 


e 
ISt may be understood and measured. d : 
© rest is the impact of chance. In princip c 


contain a significant ch. 


Under a second conc nce may be an inherent prop 


l world more generally). The 
Ond the scope of this text (se 
ut the basic idea is that social 
ina game of eight ball. The cur 
very small and seemingly insig- 
Where two balls make contact lead 
angles at which the balls separate. 


Physica 
well bey, 
tion), b 
break 
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That is, very small initial differences produce very large conse- 
quences. And just as where the balls will stop after the break has 
a very large element of uncertainty, so does social life. Note that 
random measurement error may be conceptualized in this fash- 
ion because measuring is itself a social process. 

Whichever conception one favors, one proceeds in practice 
with the assumption that, whatever the program processes at 
work, also at work will be forces that have some impact on out- 
comes of interest. Typically, these are viewed as a large number 
of small, random perturbations that on the average cancel one 
another. Thinking back to the test-taking example above, each 
neglected factor (e.g., amount of sleep the night before) 
introduces small amounts of variation in a child's performance, 
but the aggregate impact is taken to be zero on the average (i.e., 
their expected value is zero). Yet, because the aggregate impact 
is only zero on the average, the performance of particular stu- 
dents on particular days will be altered. Thus there will be 
chance variation in performance that needs to be taken into 
account. As before, one can apply tests for statistical signifi- 
cance or confidence intervals. One can still ask, for example, if 
Some observed difference between experimentals and controls 
is larger than might be expected from these chance factors 
and/or estimate the “wiggle” in experimental-control disparities. 

In case it is not clear, statistical conclusion validity speaks to 
the quality of inferential methods applied and not to whether 
Some result is statistically significant. Statistical conclusion 
validity may be high or low independent of judgments about 
Statistical significance. (For a more thorough discussion of these 
and other issues of statistical inference in evaluation research, 
and statistical inference more generally, see Berk and Brewer 
1978; Barnett 1982; Pollard 1986.) 


Putting It All Together in a 
Research Design 


To briefly summarize our discussion so far, planning an 
evaluation requires a number of decisions that will affect the 
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validity of the research. First, choices have to be made i 
the observed units (e.g., people, neighborhoods, schools) ur E 
selected. Probability sampling is one example. Second, d k 
sions have to be made about how measurement will be un a 
taken. For example, an arrest might be measured by an Ww. 
report filed by a police officer. Third, it is also essential to ci A 
sider how the treatment may be delivered. Random assignme™ 
is one instance.‘ Plans for undertaking these three WS, 
selecting the units, measurement, and delivering the interv 
tion —constitute the research design of an evaluation. P" 
While the research design speaks to the validity of the wt. 
there are other planning decisions that affect the relevance D ly 
evaluation and whether the research design can be wae, 
implemented. In the case of relevance, the intervention in 
approximate as closely as possible the options in the policy t 
In addition, the outcome measures must reflect an outcome t e 
policymakers care about. If the goal of a program is to p 
crime, for example, reducing arrests may or may not be a reaso i 
able proxy (given that many crimes are not reported and um 
arrests are made for only a fraction of reported crimes}. In the WoI$ 


: RE. 
of all possible worlds, a demonstrable program effect is dismiss 
because it is the wro 


The Best Possible Strategy 


in the next chapters, the general issues just raised will be 
addressed in more depth, Before Proceeding, however, it is 
; i 
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important to stress that practical constraints may intervene in 
the “real world" of evaluation research, even when an ideal mar- 
riage is made between the evaluation questions posed and the 
empirical techniques employed. Problems of cost, timeliness, 
political feasibility, and other difficulties may prevent the ideal 
from being realized. This in turn will require the development 
of a "second best" evaluation package (or even third best}, more 
attuned to what is possible in practice. Yet, practical constraints 
do not in any way justify a dismissal of technical concerns; if 
anything, technical concerns become even more salient when 
less desirable evaluation procedures are employed. 


Notes 


1. Evaluations are also vulnerable because they rarely have the advantage of 
a thorough review by social scientists who were not connected to the project. 
Some argue that, in academic work, research results are typically scrutinized by 
“peer review” before publication. Important problems are often detected, there- 
fore, before the research is made public, Whether this is true, however, is open 
to dispute. In any case, in part because of time constraints, evaluations are 
usually made public without the equivalent of a peer review. 

2. Another strategic advantage of attacking a study's methods is that one can 
Capitalize on popular but naive notions of science. It is common for science to 
be viewed as a fully objective activity that proceeds by certain hard-and-fast 
rules. If the rules are not followed, the activity is not science. In fact, scientific 
activities are some complex combination of rules, guidelines, intuition, habit, 
and social pressure. However, if it can be shown that an evaluation failed to fol- 
low some tule (e.g., the subjects were not a representative sample from some 
designated population), its credibility among many policymakers can be seri- 
ously jeopardized. 

3. While “nothing” may be one of the options (serving as a comparison 
group), it cannot be overemphasized that nothing is not nothing (pardon our 
Zen). At the very least, “nothing” is likely to be the status quo. Moreover, sub- 
Jects exposed to the status quo may react in a variety of ways (e.g., resentment, 
depression} if they know that others have been exposed to some innovative inter- 
vention. In this instance, the status quo becomes a treatment in the conven- 
tional sense; it does something new to subjects. 

4. It is important to understand that the issues outlined under "chance" apply 
to all varieties of evaluation research, whether quantitative in approach or 
Qualitative. However, the methods for dealing with the role of chance are more 
thoroughly and explicitly developed for quantitative methods. 

5. We will have a lot more to say about random assignment later. However, 
the basic idea is that, if subjects are assigned to experimental and control condi- 
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tions by the equivalent of a flip of a coin, the experimental and control groups 
will be, on the average, comparable before the treatment is introduced. This 
allows for a fair (unbiased) test of the intervention’s impact unconfounded with 
preexisting differences between the experimental and control groups. 


3 


Designing New Programs: 
A Chronological Perspective 


The Basic Questions 


Virtually all evaluation research begins with one or more pol- 
icy questions in search of answers. Such questions may include 
how widespread a social problem is, whether any program can 
be enacted that will ameliorate a problem, whethcr an existing 
program is effective, whether an existing program is producing 
enough benefits to justify its cost, and so on. The following 
chronological sequence is implied: 


identification of policy issues; 
formulation of policy responses; 
design of programs; 

improvement of programs; 
assessment of program impact; and 
determination of cost-effectiveness. 


or Re Exin 


In practice, sometimes not all six activities are addressed, 
often with good reason. For example, an evaluation of an ongo- 
ing social program such as Social Security might properly begin 
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with the fourth question: "improvement of programs." Less fa 
quently, the questions are addressed in another chronoloeiga 
order. For example, a decision might be made on politica 
grounds to change the Social Security system. Then, the precise 
nature of those changes would have to be delineated after an 
empirical analysis of whose needs are not being properly me 
However, the six activities provide an initial conceptual frame 
work for what lies ahead. 


Fitting the Evaluation 
Strategy to the Problem 


Each of the questions raised by a particular evaluation pde 
tackled at levels varying in intensity and thoroughness. W. ie 
great precision is needed and ample resources are ae 
most powerful evaluation procedures may be employed. W! ces 
the occasion demands approximate answers or when Dr 
are in short supply, "rough-and-ready" (and, usually, spee : 
Procedures can be used. Correspondingly, the answers supp e^ 
vary in quality: The findings of some evaluations are more oe ; 
ble than others, but all genuine evaluations produce findings 
that are better than speculation. They are also likely to produce 
better findings than conventional wisdom, especially if the wis- 
dom is ideologically congenial. For example, it is commonly 
believed that the death penalty deters would-be murderers 
despite study after study failing to find any deterrent effects. 

This does not 


able. Rather, they should use the best possible procedures, give? 


est we could do under the circumstances,” when, in fact, tech- 
nically superior (and often less wasteful) procedures easily could 
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research must draw on a variety of perspectives and on a pool of 
heterogeneous procedures. Thus approaches that might be use- 
ful for determining what activities were actually undertaken 
under some educational program, for instance, might not be 
appropriate when the time comes to determine whether the pro- 
gram was worth the money spent. Similarly, techniques that 
may be effective in documenting how a program is functioning 
Ona day-to-day basis may prove inadequate for the task of assess- 
ing the program's ultimate impact. 

The choice among evaluation methods depends in the first 
place on the particular question posed; appropriate evaluation 
techniques must be explicitly linked to each distinct policy 
question. While this point may seem simple enough, it has been 
overlooked far too often, resulting in a forced fit between an 
evaluator's preferred method and particular questions at hand. 
Another result is an evaluation research literature padded with 
empty, sectarian debates between warring camps of "true 
believers." For example, there has been a long and somewhat 
tedious controversy about whether assessments of the impact of 
Social programs are best undertaken with research designs in 
which subjects are randomly assigned to experimental and con- 
trol groups or through theoretically derived causal models of 
how the program works. In fact, the two approaches are com- 
plementary and can be effectively wedded (e.g., Rossi, Berk, and 
Lenihan 1980; Heckman and Robb 1985). 

In the second place, the choice among evaluation methods is 
Conditioned by the resources available and by the amount of pre- 
cision needed. For example, independent of available resources, 
a sample of elderly individuals as small as 300 may be sufficient 
to establish that a significant number of senior citizens have 
incomes below the poverty line. That is, the sample is large 
enough to document the existence of a social problem. However, 
if the program design requires a precise estimate of how many 
such individuals there are, a sample of several thousand may be 
needed. Likewise, it is important to consider the ratio of pro- 
gram costs to evaluation costs. Devoting more resources to an 
evaluation than to the program being evaluated is often overkill 
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nan 
and/or political suicide. Nor does it make sense to P. the 
evaluation that will take several yeas to complete w. 
answers it will Supply are needed within a few weeks. R. 
Finally, evaluations need to be tailored to the degree o m. 
tance of the issue under scrutiny. At one extreme, seen id 
for potentially low-impact programs probably do Dore make 
be evaluated with any degree of care. For example, "s a clips 
very little substantive difference whether soft stee P is not 
are superior (or inferior) to plastic paper clips: Hence, ing theif 
worthwhile investing many (if any) resources in evaluati : Sight 
Comparative merits. Yet, such judgments often are not a aay 
forward. While the two kinds of paper clips, for examp. Spe 
perform similarly, they may have very different rue eus the 
consequences, A lot would depend upon the ways in whi 
two kinds of clip. 


-ht 
1 issue mig 
them when they are discarded. And, of course, the iss 
be extremely sali ms 
s í : : ; rogra 
In contrast, policies dealing with central issues and m OPES 
that are Very expensive usually deserve the most care omoes 
tion possible. Thus effort to restrict the use of chloroflo 


ay 
mous crop damage In Short, it would be foolish to settle for any 
thing less than the very best. 


The Policy Contexts of Evaluation 


The six chronologica] activities listed 4 
a richer and broad 


raised about the nature and amount of 
whether appropriate policy actions can bi 
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programs that may be proposed are appropriate and effective. In 
other words, the first context looks to the future and what might 
be done. Examinations of existing policies and programs is the 
second, in which attention is directed toward whether extant 
policies are appropriate and whether current programs have had 
their intended effects. Thus the second context reviews the past 
to inform the future. 

Although these two broad contexts, like the earlier six activi- 
ties, may be regarded as sequential, it often happens that the 
unfolding policy process may bypass earlier steps. Many major 
programs have truncated policy formation stages, going straight 
from the drawing boards of executive agencies or legislatures to 
full-scale operation. For example, the Head Start and school 
lunch programs were launched with little program testing 
beforehand. The issue of whether Head Start was truly effective 
did not surface until some years after the program had been in 
place. Similarly, many programs never get beyond the testing 
stage, because of demonstrated ineffectiveness, political opposi- 
tion (e.g., contract learning: Gramlich and Koshel 1975], or 
Changes in the policy space (e.g., the negative income tax pro- 
posals: Rossi and Lyall 1974]. 


Looking to the Future: 
Some Steps in Policy and 
Program Formulation 


Some Background 


Proposals for policy changes and new programs presumably 
arise out of dissatisfaction with the status quo. Sometimes, 
existing policies and/or programs are not performing as hoped or 
the problem they were designed to address has changed (or was 
misread). For example, hardly a week goes by without visible 
dissatisfaction being expressed about the ways public policy and 
Programs are responding to the "drug problem." Sometimes, new 
problems arise that were previously unaddressed. Thus current 
congressional concern about the "greenhouse effect" effectively 
is new, although scientists have been studying the problem for 


N 
38 THINKING ABOUT PROGRAM EVALUATIO 


decades. Ideally, Scientific information may be brought to a 
on both the nature of the social problem and the potentia: 
programmatic responses, Sa ers 

It is important that the previous paragraph not be misun ; 
stood. In Particular, we are not implying that the d 
offered by Policymakers will necessarily confront the 
Problem in some Objective sense. The "real" problem may be a 
from obvious and certain responses may be immediately pamm 
impractical or politically unpalatable. For example, is the “re id 
problem with narcotics the large number of people who iei 
addicted or the current policy that criminalizes the use of ns f 
Cotics? Some argue that criminalization leads to a variety 0 


^ e : n- 
hard to argue against Providing the best possible data on pote 
tial areas of need, there is no ne : : 


; n an analysis of pending legislation 
designed to reduce adolescent pregnancy, the General Account 
ing Office (GAO 1986) found that none of the legislation 


00 cheaply. The problem, it seems, 
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grams. A proposed improvement is nothing but a proposed 
change. Correspondingly, evaluation procedures applicable to 
entirely new policies and programs are suitable for proposed 
changes in existing policies and programs. Therefore, the discus- 
Sion that follows does not distinguish between them. 


Stage 1: 
Defining the Problem 


A social problem is a social construction. That is, a condition 
defined as problematic becomes a problem. Moreover, the par- 
ticular manner in which the social problems are articulated 
may have dramatic effects on the kinds of remedies that are sug- 
gested. For example, two contending legislative proposals may 
€ach address the needs of homeless persons, one identifying the 
homeless as low-income individuals who have no kin upon 
whom to be dependent, and the other defining homelessness as 
the lack of access to conventional shelter. The first definition 
Centers attention on social isolation, while the second concen- 
trates on the availability of affordable housing. It is likely that 
the ameliorative actions that follow will be different as well. 
The first might emphasize a program to reconcile estranged 
individuals with their relatives, while the second might imply 
a subsidized housing program. 

To pursue another example, the presence of hazardous sub- 
Stances in water supplies may be defined either as a use problem 
Or as a production problem. In the first instance, appropriate pro- 
8rams might emphasize how best to educate users about avoid- 
ing contaminated water or about purifying water before con- 
sumption. The second definition might lead to surveillance of 
Potential polluters and sanctions for violating local pollution 
Ordinances. Note that these two definitions are not contradic- 
tory; rather, each highlights an aspect of the problem. 

The construction of social problem definitions is, of course, 
Dot a task for which evaluators are uniquely trained. Lawyers, 
judges, staff in administrative agencies, and substantive special- 
ists of various kinds (e.g., hydrologists in the case of water sup- 
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ply) are also trained in how to think about social ike 
addition, there is often a large number of "lay" experts, who A 
sometimes more sophisticated than the credentialed a 
Yet, there is a special role that evaluators can play in this por h 
ofthe evaluation process; they can help all parties think eri 
the substantive and methodological implications of d 
Social problem definitions, For example, it is clear that yes 
definitions of water pollution given above focus on mari 
different (albeit overlapping) phenomena, but they also ae Yu 
clues about the underlying causal factors. Moreover, while e 
first definition leads to a widespread educational effort d 
population at large, the second suggests narrowly focused anis 
sight efforts directed at business firms and municipalities. liti- 
former may be more expensive, but also more palatable V sth 
cally. Yet, it might be easier to monitor the more focused in ae 
vention and estimate Program impact, perhaps because ee 
already routinely collected on discharges into local rivers ften 
lakes. In short, judgments about definitional issues da a: 
require substantive and methodological knowledge that ev: 
tors often have (or can easily get). -ineto 
The evaluator can also play an important role by raisin S ei 
discussion the fit between popular conceptions of the prob 


roce, s : : isla- 
and the implicit or explicit definitions included in the leg 
tive or administrative re 
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enced by married teenagers. Likewise, an affirmative action pro- 
gram for graduate training in the natural sciences may incor- 
rectly assume that the pool of minority undergraduates properly 
prepared for graduate work in the natural sciences is as large as 
the comparable pool of White undergraduates. 


Stage 2: 


Where Is the Problem and 
How Big Is It? 


Needs Assessments 


Proper design of a public program and projection of its costs 
require good information on the density, distribution, and over- 
all size of the problem in question. For example, in providing 
financial support for emergency shelters for homeless persons, 
it would make a very significant difference if the total homeless 
Population is approximately 3.5 million or approximately 
350,000 (both estimates have been advanced]. It would also 
make a big difference whether the problem was located primar- 
ily in central cities or whether it can be found in equal densities 
in small and large places. 

An identified problem often is a complex mix of related con- 
ditions; planning requires information on that complexity. 
Retaining the example of homelessness, the proportions of the 
homeless suffering from chronic mental illness, chronic alco- 
holism, or physical disabilities need to be known in order to 
design an appropriate mix of interventions. 

It is much easier to identify and define a problem than to 
develop valid estimates of its density and distribution. For 
example, a handful of battered children may be enough to estab- 
lish that a problem of child abuse exists. However, to know how 
much of a problem exists and where it is located geographically 
and socially involves detailed knowledge about the population 
of abused children and their distribution throughout the politi- 
cal jurisdiction in question. Such exact knowledge is ordinarily 
much more difficult to obtain. ; 

Through their knowledge of the existing literature (consist- 


ON 
42 THINKING ABOUT PROGRAM EVALUATI 


: jes, 
ing of government reports, published and unpublished un 
and limited-distribution reports), and their Vice a f 
which designs and methods lead to credible results, drum 
researchers are in a good position to collate and assess i sisi 
information exists on the issues in question. Equal emip goce 
given in the last sentence to both “collate” and "assess": ie vu 
ated information can often be as bad as no acu ad in. 

For some issues, existing data sources may be re open d 
quality to be used with confidence. For example, in oe um 
that is routinely collected either by the Current Populati ality: 
vey or the decennial census is likely to be of adequate ie ae 
Likewise, data available in many of the statistical et Sok 
tinely collected by federal agencies are often trust-worthy. i5 
when data from other sources are used, it is always peee d^ 
carefully examine how the data were collected. The eai 
of data quality is again a task for which evaluators are emin 
qualified. ill pro- 

A good rule of thumb is that existing data sources W 
vide contradictory es 
Sometimes be reduce 
data on the same to 


ple, both the Coalit 
Rifle Association h 


Designing New Programs: A Chronological Perspective 43 


the level of popular knowledge concerning how such substances 
can be safely deployed. Any instance of household pesticide mis- 
use constitutes a problem, but how serious the problem is for, 
say, households with children present, may be unclear. More- 
over, the precise content of the problem may be obscure. Perhaps 
households lack knowledge about the toxic properties of certain 
pesticides or, alternatively, they lack knowledge about other 
Ways to control household or garden pests. Ordinarily there are 
n0 data sources from which information on such issues can be 
obtained. Under these circumstances, an evaluator may wish to 
undertake a preliminary study to estimate the amount and dis- 
tribution of household pesticide use 2nd knowledge about pesti- 
cides’ toxic properties. 

There are several ways of making such estimates of “need.” 
Perhaps the easiest to undertake, but also the least reliable, is to 
collect “expert” testimony. Most of the larger estimates of the 
size of the homeless population are essentially compilations of 
local "experts" guesses of the numbers of homeless in their 
localities. (See U.S. Conference of Mayors 1987.) Another infor- 
mation source that can be reliable, but is often unavailable, is 
records from organizations that provide services to the popula- 
tion in question. For example, the extent of drug abuse may be 
extrapolated from the records of persons treated in drug-abuse 
clinics. Insofar as that the drug-using community is fully cov- 
ered by existing clinics, such data may be quite accurate.? 

In many cases, it may be necessary to undertake quite elabo- 
rate research in order to assess the extent and amount of some 
problem. To illustrate, the Robert Wood Johnson Foundation 
and the Pew Memorial Trust were trying to plan a program for 
increasing the access of homeless persons to medical care. 
Although there was ample evidence that serious medical condi- 
tions existed among the homeless populations in urban centers, 
there was virtually no precise information on either the size of 
the homeless population or the extent of the medical problems 
in that population. 

Hence, the foundations funded a research project to devise 
technical advances needed in sample survey methods to collect 
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the missing information. The result was a study that influenced 
most of the subsequent research on homelessness and has led to 
changes in plans for the 1990 Census, making it possible to 
arrive at reasonable estimates of the homeless population on à 
national basis (Rossi, Fisher, and Willis 1986). 

Needs assessment research is usually not as elaborate as the 
pilot research described above. In many cases, straightforward 
sample surveys can provide most of the information necessary. 
For example, in planning for educational campaigns to increase 
public understanding of the risks associated with hazardous 
Substances, it would be necessary to have a good Magius 
of what the current level of public knowledge is and Wu 
population subgroups pose Special problems. A national samp'e 
Survey would provide the necessary information.^ , 

The number of local needs assessments covering single 
municipalities, towns, or counties done every year must now m 
in the thousands. For example, the 1974 Community Menta 
Health legislation called for community mental health needs 
assessments to be undertaken periodically. The 1987 ip c 
Act, mandating aid to the homeless, called for states and loca 


; ix e- 
planning programs for the homeless. And social impact x 
ments to be prepared in advance of large-scale alterations in t 


Designing New Programs: A Chronological Perspective 45 


may also be instructive, especially in getting detailed knowl- 
edge of the specific nature of the needs in question. For example, 
the development of educational campaigns may be considerably 
aided by qualitative data on the structure of popular beliefs. 
What, for instance, are the trade-offs people believe exist 
between the pleasures of cigarette smoking and the resulting 
health risks? 

An especially attractive feature of qualitative approaches is 
that they are sometimes inexpensive. Certainly, conducting 
three or four focus group sessions is cheaper than conducting 
the usual sample survey. Such groups may be especially instruc- 
tive if groups members are unusually knowledgeable "inform- 
ants" who have access to information that ordinary citizens 
would not. For example, problems that an emergency room 
might be having serving large numbers of low-income patients 
might be best articulated by emergency room doctors and nurses 
and by key administrators in that hospital. A haphazard cross- 
Section of citizens would have little concrete information to 
offer. 

However, qualitative approaches can be very expensive if they 
mean placing several researchers in the field for a number of 
months. For example, a study of the job training needs of low- 
income, single parents (primarily women) might require two 
ethnographers and six months of fieldwork. The cost of the 
Project would then be one person-year of effort from two highly 
trained anthropologists plus their research expenses (including 
travel, food, and lodging). The total bill could easily top 
$100,000. 

Although needs assessment research is ordinarily under- 
taken for the descriptive purpose of developing accurate esti- 
mates of the amounts and distribution of a given problem, needs 
assessments can also yield some understanding of the underly- 
ing mechanisms. For example, a search for information on how 
many high school students study a non-English language may 
reveal that many schools do not offer such courses; part of the 
problem is that opportunities to learn foreign languages are 
insufficient. Or the fact that many primary school children of 
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low socioeconomic backgrounds appear to be tired and list- 
less in class may be explained by a finding that many ate no 
breakfast. 

Carefully and sensitively conducted qualitative studies aa 
particularly important for uncovering process information 0 
this sort. Thus ethnographic studies of disciplinary problems 
within high schools may suggest why some schools have fewer 
disciplinary problems than others, in addition to providing 
some indication of how widespread disciplinary problems ares 
The findings on why schools differ might suggest useful ways A 
which new programs could be designed. Or qualitative researc! 
on household energy consumption may reveal that few rest 
dents had any information on the energy-consumption charac 
teristics of their appliances. Not knowing how they consume 
energy, household members can not develop efficient strategies 
for reducing consumption. 

Indeed, the history of ups and downs of public concern : E 
social problems provides many examples of how qualitative 
studies (e.g., Lewis 1965; Liebow 1967; Riis 1890; Carson 1955), 
and sometimes novels [e.g., Sinclair 1906; Steinbeck 1939), have 
raised public consciousness about particular social problems. 
Sometimes the works in question are skillful combinations O 
qualitative and quantitative information, as in the case of Har- 
rington (1962), whose The Other America contained much pub- 
licly available data interlaced with graphic descriptions of the 
living conditions endured by the poor. 

Finally, for program planning purposes, it is often important 


to be able to project current circumstances into the future. 
problem that is serious at 


less serious years later. 
quite risky, especially as the time horizon lengthens. There are 
a number of technical and practical difficulties, which derive 1” 
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the population. However, had demographers in central Africa 
made such a forecast 10 years ago, they would have been sub- 
stantially off the mark. They would have failed to anticipate the 
tragic impact of the AIDS epidemic, which is most prevalent 
among young adults. Projections with longer time horizons 
would have been even more problematic because trends in fertil- 
ity as well as mortality would have to have been included. 

We are not arguing against forecasting. Rather, we are con- 
cerned about uncritical acceptance of forecasts without a thor- 
ough examination of how the forecasts were produced. Examin- 
ing the forecasting assumptions, for example, is a task that can 
range considerably in complexity. For simple extrapolations of 
existing trends, the assumptions may be relatively few and eas- 
ily ascertained. But, even if the assumptions are known, it is 
often unclear how to determine if the assumptions are reason- 
ably met. For projections developed from multiple-equation 
Computer-based models, examining the assumptions may 
require the skills of an advanced programmer and the insight of 
a sophisticated statistician. In any case, all forecasts should be 
reported as both point and interval estimates. The former is typi- 
cally a single “best” guess, while the latter is a range of values in 
which the true (future) value likely lies. Yet, for a large number 
of forecasting models, it is not apparent how a proper confidence 
interval may be constructed. 


Stage 3: 
Can We Do Anything 
About the Problems? 


Problem-Driven Research 


Diagnosis may be the first step on the road to treatment. The 
Second step is understanding enough about the problem and its 
Setting to devise appropriate remedies. That is, knowing a lot 
about the distribution and extent of a problem does not by itself 
lead automatically to solutions. In order to design programs, one 
must call on two sorts of knowledge. First, one needs valid 
knowledge on the leverage points and interventions useful for 
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one | 
changing the distribution and extent of a problem. Second, 


á ; ut the 
needs to know from a variety of sources something abo 


y : $ > " rkable | 
Institutional arrangements that are implicated so that wo | 


policies and programs can be designed.? — 
For example, applied research in microeconomics has 


: ice. Other | 
repeatedly that consumers typically will respond to price 


dity 
things being equal, they will generally buy less of a commo 


n me 3 ; ervation 
if its price increases. This lesson can be applied to cons | 


: : Heus states 
of all sorts. Yet, it has been virtually impossible in mergit 
to institute marginal cost pricing for water because Of p 


Bs H existing 
opposition from large agricultural users, who, under 


: ja 

Schemes, are being subsidized by residential and indust"! 
users (Berk et al. 1981). F lied 

To take another illustration from water conservation, vd are 
research in social psychology indicates that people W. p 
likely to conserve believe that others drawing on the roo 
Tesource are conserving as well. Yet, it is unclear how water com 
sumers who believe that other consumers typically are not ort 
serving can be convinced that they are not alone in their di pi 
for conservation efforts. The only consumption data they : y 
cally see are their own (on their bill). One strategy employe ac 
Some water districts in California has been to enclose in € 
consumer bill a short newsletter re 
consumption for im 
et al. 1981). 


am 
It cannot be overemphasized that, to construct a pe 
likely to be adopted by an Organization, one needs to know it 
to introduce new procedures that would be undertaken W 


F jes, 
e-scale organizations — schools, factor 
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porting aggregate ten 
portant segments of the community | 


individual teachers. In short, inadequate attention to th® 
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pared, and/or lacking in the necessary skills is a sure recipe for 
degraded interventions. Indeed, under such circumstances, it is 
possible that no programs at all will be delivered. 


Stage 4: 
Developing Promising Ideas into 
Promising Programs 


Moving from a conception of what may be done to a hypothet- 
ical program is the next step, and the act of transforming promis- 
ing ideas into a set of concrete activities is essentially the prac- 
tice of art rather than science. Moreover, because the knowledge 
required is primarily substantive, evaluators have no clear or 
necessary role. However, evaluators are more likely to make 
important contributions in the translation of ideas to programs 
insofar as they have a good understanding of the workings of 
Similarly conceived past programs and of the capabilities of 
organizations likely to implement the program in question. 

As briefly described earlier, for example, during the severe 
energy crisis of the late 1970s, needs assessments revealed that 
consumers had little specific knowledge of how their use of elec- 
trical appliances affected energy consumption. Of course, 
nearly every consumer knew that keeping refrigerator doors 
Closed saved electricity and that turning off electrical burners 
When not being used for cooking would lower electricity con- 
Sumption. However, few knew that there was wide variation in 
the energy used by different brands of refrigerators and electrical 
Stoves. Needs assessment research also showed that most con- 
Sumers were quite concerned about energy costs. In short, there 
Was a reservoir of motivation to adopt energy conservation mea- 
Sures and substantial gaps in popular knowledge about how best 
to conserve. 

Given these circumstances, there were a variety of programs 
that could have been developed, some resting on pricing changes 
that would have rewarded consumers for using appliances less 
during high-demand periods of the day and others based on 
educational efforts urging consumers to lower their thermostat 
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settings. Furthermore, within each of the these broad categories 
of programs, there were a variety of specific measures. POM 
schemes, for instance, might be built on the marginal price, E 
average price, increasing block pricing, and so on. And any a 
ing scheme based on units of consumption could proceed ony 
if energy consumption could be accurately measured (e.g. i 
meters). Ideally, the energy consumption for different e 
ances should be monitored so that, in principle, pM 
could determine which appliances were especially HE 
(€.g., toasters) or inappropriately used (e.g., ovens used for "irs 
ing a room). As a compromise, perhaps energy use cou to 
metered by room. And, finally, means would be aree a 
inform consumers about their energy consumption in ways t d 
effectively communicated the consequences of how they ex 
appliances; rapid and accurate feedback would be an oso 
part of the program. Ideally, appliance-by-appliance breakdow: 
should be provided. aken 
The point is that programs are a set of activities underta "od 
by individuals and organizations. Specifying these "details Pa 
very long way from the broad ideas about possible intervent10 
and it requires "nuts-and-bolts" knowledge of past programs we 
Current prospects. In the energy consumption example above, ? 
evaluator would ideally know a lot about a large number i 
earlier conservation Programs and about the day-to-day tune 
tioning of the local utility company. Note, however, that suc 


n : : e 
knowledge is primarily substantive and hardly the sole preserv 
of evaluation researchers, 
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instituted and whether there is any evidence that they might 
work. Thus, if consumers are to pay a higher per-unit price as the 
amount consumed increases (e.g., under increasing block pric- 
ing), there must be some assurance that consumers understand 
their electric bills and the links between how they use appli- 
ances and how much electricity is consumed. These are the 
kinds of tasks by which evaluators earn their keep. 

Likewise, evaluation skills per se are not especially relevant 
for the translation of broad conceptions of educational televi- 
Sion programs into the scripts, lighting, directing, filming, and 
editing of Mr. Rogers' Neighborhood. However, evaluators can 
play an essential role in the pilot testing (pretesting) of such TV 
Programs. For example, in some circles, educational programs 
must demonstrate that they can get the attention of their 
intended audience, be understood by them, and produce a 
Predisposition to act in a desired fashion. Thus pilot versions of 
new programs are often tested on small audiences whose 
responses are carefully monitored. Elements of a program that 
repel audiences, lead to misunderstanding, or lead to undesired 
behavior can be changed. Then, the program can be finely tuned 
until pretest audience responses are acceptable. 

In practice, useful pilot studies can fall short of full scientific 
rigor with, for instance, pretest TV audiences that are selected 
haphazardly. Or pilot studies can involve rigorous research pro- 
grams that would do a major university proud. Toward the less rig- 
orous side of the continuum, pretesting is routinely undertaken 
by the Children's Television Workshop, producers of Sesame 
Street. The producers employ volunteer pretest audiences of 
preschoolers to measure the attention-getting abilities of its TV 
episodes. The producers watch how closely pretest audiences fol- 
low the action of the episode being tested. In addition, the 
audience is interviewed after each showing to ascertain whether 
or not the message of the program was understood clearly. Program 
deficiencies are then rectified, and the process is then repeated 
until a program acceptable to the producers is finally achieved. 

At the other extreme, the Lodge Program developed by Fair- 
weather and his associates employed very rigorous pilot-testing 
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procedures. The goal was to return mental patients to nonin- 
Stitutionalized life in a way that would reduce the chances of 
being rehospitalized. Drawing upon social science findings 
about the importance of informal support within small groups, 
Fairweather and his colleagues took two decades to develop a 
technique that could be used by most mental hospitals and that 
was demonstrably effective in lowering the return rates. The 
development process consisted of a series of randomized field 
experiments in which version after version of the program was 
tested until an effective version was achieved. 

_ Thorough Pretesting during the development phase can 
increase the chances that a worthwhile program will emerge. But 
1t is one thing to have a Program that works well with test sub- 
Jects and quite another to have a program that will work well with 
teal subjects, For example, a Sesame Street episode that does well 
in a studio atmosphere has none of the competition for attention 
that exists in an ordinary living room. Indeed, an adult-oriented 
health information program, Feeling Good, that was developed 
by the Children’s Television Workshop did well with pretest 
audiences but failed to achieve significant audience shares when 
aired on public Ty Stations during prime viewing hours. The test 
audiences in the Studios liked the episodes they viewed, but the 
unconstrained audience preferred programs on other channels 
that were competing with Feeling Good. 


Stage 5; 
The YOAA Problem 
Once à Prospective program has been refined through oe 
Studies, time comes to transport the program to a more realistic 
Operating environment, However, moving from the pred 
ment phase to the operational phase usually means pe 
Tesponsibi]ity from a tesearch-oriented organization to an rr 
nu agency. This leads to the Car, YOAA Dolt?” problem: ^ 
your ordinary American agency” carry out the program NOR 
fidelity? Often the YOAA problem has been identified with Se 
Character of large-scale bureaucracies, a diagnosis that ca 
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as much as it illuminates. The issue is whether an operating 
agency has the appropriately trained personnel, a sufficiently 
motivating reward system, and the resources to carry out a pro- 
gram at the desired level of fidelity. Asking an already overbur- 
dened agency to take on additional work, especially work for 
which its personnel are not trained, is clearly a recipe for failure. 
For example, an emergency room program for counseling crime 
victims, developed in a major Los Angeles hospital, was never 
implemented because social workers paid by the program were 
actually used to reduce shortages of social workers in the wards. 

In addition, even agencies with the requisite resources and 
skill may in practice “drop the ball” because of some legitimate 
confusion, incomplete communication, insufficient follow- 
through, or a host of effectively unpredictable difficulties. For 
example, an experimental program in Colorado Springs, 
Colorado, to test different policing strategies in domestic vio- 
lence incidents was at first poorly implemented because of a 
totally unrelated strike threatened by rank and file officers who 
Were seeking bargaining rights for their union. Fortunately, pro- 
gram implementation improved dramatically when the issues 
underlying the threatened strike were effectively resolved. 

Therefore, it is vital to study how programs are implemented, 
and descriptive accounts may be especially valuable. For exam- 
ple, just a few field visits to high schools that were supposed to 
have in operation a widely publicized program designed to raise 
the academic motivation levels of poor Black children revealed 
that the programs existed mainly on paper and in the public 
relations releases of the main sponsor (Murray 1980). Similarly, 
Careful observations at the sites of the celebrated Cities in 
Schools Project brought to light that the projects as imple- 
mented fell far short of original designs and intentions (Murray 
1981). 

It is at this point that it may make some sense to initiate 
demonstration programs in which operating agencies attempt 
to implement the program. Demonstration programs can be 
Viewed as another developmental step when attention is cen- 
tered on the problems that operating agencies encounter carry- 
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ing out a program. A prime example is the Lente 
experiment” (a misnomer because these demonstrations M 
not truly experiments) carried out in connection with the p 2 
posed housing voucher program. Ten municipalities ie 
selected to work out Procedures for administering aie 
voucher programs in their localities and to carry them out if 
Period of years. The demonstrations were closely eura " 
researchers, who carefully noted all the difficulties each i a 
ten cities encountered in administering their versions o 
housing voucher Program (Struyk and Bendick 1981]. 


Stage 6: 
Will a Particular Program Work? 


The Effectiveness Issue 


f i: ] kinks 

After a program has been fine-tuned and its RU E the 

ironed out through demonstrations, there still remai sd 
question of effectiveness, To this point, all one has manag 


ne: s istin 
dity. Second, it is often difficult to dis" 


i “noise: 
variation, which, as “n dis- 
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shaped by a large number of forces. Yet the programs introduced 
rarely address more than one of these. Nutritional behavior, for 
example, is affected by upbringing, ethnic background, dispos- 
able income, local availability of food products, information 
about nutritional issues, subjective estimates of risks to health 
and well-being for the nutritional behavior in question, house- 
hold composition, the nutritional practices of family members 
and peers, chemical dependencies, and many other influences. 
Yet programs meant to improve nutrition rarely target more 
than one of the possible influences. To make matters worse, 
there appears to be no single developmental stage that, if inter- 
rupted, will improve nutritional practices effectively. In short, 
there are many ways to affect eating habits, but each by itself is 
à small piece of the picture. 

When a promising program has been identified, and a 
reasonable working version developed, the next step is to see 
whether the program is effective enough to justify it becoming 
a routine part of some agency's activities. At this point, we rec- 
ommend the use of randomized experiments to test the effec- 
tiveness of candidate programs. Later, a wider range of design 
Will be discussed when we turn to evaluation of ongoing pro- 
grams. Because the alternatives to random assignment are typi- 
Cally less desirable for causal inference, randomized experi- 
ments should be the design of choice when random assignment 
can be properly implemented. The desirable situation is far 
more likely to exist when new programs are being developed 
than when an ongoing program must be assessed, and so we will 
Consider "quasi-experiments" later. 

Randomized experiments are desirable (some would say man- 
datory) because randomly allocating persons (or other units, such 
as classes) to an experimental group (to which the tested program 
is administered) or to a control group (from whom the program 
is withheld) assures that all the factors ordinarily affecting the 
Outcome in question are, on the average, distributed identically 
across those who receive the program and those who do not. 

Therefore, randomization, on the average, prevents the con- 
founding of estimated treatment effects with the impact of 
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other factors that may affect the outcome. As a result, internal 
validity is enhanced enormously, and the likelihood of reporting 
spurious causal effects dramatically reduced. 

We advocate the use of randomized experiments at this stage 
in program development because of their scientific merit. (For 
other assets of randomized experiments, see Berk et al. 1985.) 
However, this commitment in no way undermines the com- 
plementary potential of qualitative approaches such as ethno- 
graphic studies, particularly to document why a particular inter- 
vention succeeds or fails. For example, in designing educational 
campaigns based on workshops, qualitative studies can uncover 
those organizations in which implementation may be most eas- 
ily achieved. For example, workshops held by employers after 
5:00 p.m. may appear to be an efficient strategy except that inter- 
views with employees could reveal that few would remain after 
hours for any purpose. Indeed, a similar program of proposed 
workshops to teach better health habits to persons at risk of 
coronary heart disease failed to attract more than a handful of 
participants. 

Developmental experiments should ordinarily be conducted 
ona relatively modest scale and are most useful for policy when 
they test a set of alternative Programs that are intended to 
achieve the same effects. For example, it might be useful for an 
experiment to test several ways of motivating people to have 
their homes tested for radon because the findings could be used 
to provide information on the relative effectiveness of several 
attractive (a priori) methods. Likewise, an experiment ona range 
of policing Strategies in domestic violence incidents — arrests, 
restraining orders, crisis counseling, citations —would be more 


instructive than an experiment that considers only two alter- 
natives. 
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ment of Labor tested the extension of unemployment benefit 
coverage to prisoners released from state prisons in a small ran- 
domized experiment conducted in Baltimore (Lenihan 1976). 
Randomized experiments have also been used to test national 
health insurance plans and direct cash subsidies for housing to 
poor families. 

Perhaps the most extended series of developmental experi- 
ments was undertaken by Fairweather and Tornatzky (1977), 
comprising over two decades of consistent refinement and 
retesting, and resulting in a replicable, effective treatment that 
could be implemented under a variety of conditions. In the same 
spirit, in three cities several extensive tests are currently under 
way designed to evaluate alternative ways of lowering the inci- 
dence of heart disease through improved nutrition. In the 
environmental area, six alternative approaches to communicat- 
ing information about radon were tested in New York (Smith et 
al. 1987). The Minneapolis Spouse Abuse Experiment, which 
tested three different policing strategies in domestic violence 
incidents, is being replicated in six new field experiments in six 
different cities (Berk and Sherman 1988). 

Given a program of proven effectiveness, the next question one 
might reasonably raise is whether the opportunity costs of the pro- 
gram are justified by the gains achieved. Or the same question 
might be more narrowly raised in a comparative framework: Is Pro- 
gram A more "efficient" than Program B, both otherwise equally 
acceptable alternate ways of achieving some particular goal? 

The main problem is answering such questions centers on 
establishing a yardstick by which comparisons may be made. For 
example, would it be more useful to divide the units of achieve- 
ment gained by dollars, the number of students covered, or the 
number of classes served by the program? In fact, usually the most 
convenient way to define efficiency is to calculate cost- 
effectiveness: the number of dollars spent per unit of output. In 
the case of Sesame Street, for example, two cost-effectiveness 
measures were computed: (1) dollars spent per child-hour of view- 
ing, a measure of the cost of running the program; and (2) dollars 
Spent per each additional letter of the alphabet learned, a cost- 
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effectiveness measure taking into account increases in learning. 
Note that the second measure implies knowing the impact of the 
program, presumably from a formal impact assessment. 

The most complicated way of addressing the efficiency ques- 
tion is to conduct a full-fledged benefit-cost analysis in which 
all of the values of all of the benefits and costs are computed. 
The ratio of benefits to costs is the benefit-cost ratio of the pro- 
gram. However, relatively few full-fledged benefit-cost analyses 
have been conducted for social programs because it is difficult to 
convert all the costs and all the benefits into the same metric. In 
principle, it is possible to convert into dollars all program costs 
and benefits. In practice, however, it is rarely possible to do so 
because of disagreements over the value of various program 
inputs and outputs. For example, it would be difficult to affix a 
dollar value to learning an additional letter of the alphabet. 

A second problem with full-fledged benefit-cost analyses is 
that they must consider the long-run consequences of the pro- 
gram in question and the long-run consequences for the next 
best alternative forgone. This immediately raises the question 
of how to value, in today’s dollars, future returns from some 
investment. The usual assumption is that current consumption 
is worth more than future consumption (in part because of 
delayed gratification), so that a dollars worth of some com- 
modity today is worth less than a dollar if consumed in the 
future. This process is called “discounting.” 

In the context of program evaluation, the future returns 
(benefits minus costs) of alternative interventions need to be 
compared after discounting. For example, an assessment of a 


vocational training program in inner-city high schools needs to 
consider (among many other thing: 
program on students’ earnings ove 


earnings of individuals receiving 
by what fraction should future e 
always a judgment call about wh 
agree (Thompson 1980). 


discounted? This is 
ich researchers routinely dis- 
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In short, complete benefit-cost analyses are typically imprac- 
tical. However, the attempt itself is often useful because it forces 
policymakers to confront the painful fact that all social programs 
have opportunity costs in both the short and the long run. In addi- 
tion, phrasing program outcomes in cost-effectiveness terms is 
often a handy method for addressing the trade-offs between alter- 
native programs. 


Practical Developmental 
Evaluation Approaches 


If all of the research activities described in the preceding 
Pages were undertaken for each and every proposed program or 
policy shift, the pace of change in American public programs 
would be appreciably slowed. Thus, while one must admire the 
devotion, care, and diligence of Fairweather and his colleagues, 
When the Lodge approach had finally been perfected, psy- 
chopharmacological developments and the community mental 
health movement had so drastically changed the treatment of 
mental health patients that the Lodge approach had become 
largely irrelevant." While Fairweather and his associates 
labored carefully and at great length to perfect the Lodge 
approach, the content of policy space had shifted to highlight 
other concerns about the treatment of the mentally ill. 

Clearly, practical approaches to program development have to 
take into account all the constraints on time and resources that 
are ordinarily confronted. Decades-long development efforts 
may be the "right" way, but the practical way must deliver the 
best possible information in a timely fashion. There are no hard- 
and-fast guidelines about how best to proceed, although a few 
broad principles may be stated. 

In general, judgments about whether to evaluate a program, 
and about how thorough that evaluation should be, should rest, 
at least in principle, on a rough benefit-cost ratio for the pro- 
Posed evaluation. That is, one must place some value on the 
information that could be obtained under different evaluation 
designs. Other things being equal, the greater the potential 
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impact of the proposed program — whether it succeeds or fails— 
the more carefully it should be evaluated. This means that pro- 
grams that promise to be costly, that may have adverse and wide- 
spread effects, or that deal with the central, gnawing problems 
of society probably deserve the best possible evaluation. Pro- 
grams in which the consequences of an ineffective program are 
slight may typically warrant far less thorough attention. Indeed, 
there is no doubt that some programs need not be scrutinized at 
all. However, before proceeding, it is vital to factor in what kinds 
of evaluations are feasible, how credible their results are likely 
to be, and what each would cost to undertake. A very important 
program, such as Social Security, may be prohibitively expen- 
sive to evaluate persuasively. Alternatively, an evaluation of a 
community's efforts to reduce bicycle accidents, by instituting 
inspections of bicycles ridden to local schools, may produce lots 
of useful information per dollar of cost. 

Finally, the evaluative activities in support of program 
development have been described above as a set of procedures 
arrayed over time. We emphasize again that this need not be the 
case. A set of experiments conducted simultaneously on several 
alternative programs can reduce the total time needed to arrive 
at useful conclusions. Demonstrations of programs can be used 
for fine-tuning purposes. Randomized experiments may be for- 
gone when there are very strong indications of effectiveness 
from nonexperimental evidence. While there is an expositional 
logic to the chronology presented and a thoroughness that fol- 
lows when each stage is executed in the order proposed, we are 


offering no recipe. Evaluation practitioners in real time and on 
site will always have to make judgment calls. 


Notes 


isane i 
n execution and the "control" is a very long 


1. In this case the "treatment" 
prison term. 
2. However, one must carefully judge what is at stake. Fo i 
cost of an evaluation may loom large compared to the co r example, while the 
gram under consideration, the evaluation findings m. EY of the particular pro- 
for many more programs and for a larger Program. In Pu aedis tee peice 
ext of the universe 
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of programs potentially affected, the evaluation budget may be relatively small. 

3. Inourexperience, as the problem definition is being constructed, it is vital 
for evaluators to make explicit what is going on: important options are being 
foreclosed. Policymakers should be constantly reminded that there are opportu- 
nity costs to their decisions. Otherwise, one risks having policymakers later dis- 
Sociate themselves from the evaluation, claiming that they were misled when 
the evaluation was designed. 

4. There are, unfortunately, exceptions. For example, it is widely acknowl- 
edged that the U.S. census undercounts the number of Blacks and Hispanics. For 
the nation as a whole, the undercount is relatively small and for most purposes 
it can be ignored. However, for some jurisdictions with large populations of 
Blacks and Hispanics, the undercount translates into substantial losses of fed- 
eral funds (because many programs are tied to the size of particular populations]. 
This led to a lawsuit by the State of New York in which statistical adjustments 
for the undercount have been proposed (Ericksen and Kadane 1985). In short, 
how good the data have to be always depends on how those data will be used. 

5. It is also the case that, if drug-abuse clinics did cover all or most of the 
drug-abusing population, drug-abuse treatment programs might not be an issue. 
Hence, to the extent that a problem is being adequately addressed by existing 
Programs, data from such programs may be useful, but that is not the typical sit- 
uation in which data are needed. 

6. There are many national survey organizations that have, under contract, 
the capability to plan, carry out, and analyze such surveys. In addition, it is often 
Possible to add questions to an existing national survey, possibly reducing costs. 
It should be noted that, for surveys of a given sample size, national surveys are 
just slightly more expensive than local surveys. 

7. On the other hand, when the time comes to assess the extent of the prob- 
lem, there is usually no substitute for formal quantitative procedures. Stated a 
bit starkly, qualitative procedures are likely to be especially effective in deter- 
mining the nature of the need. Quantitative procedures are, however, essential 
to determine the extent of need. 

8. There are a number of other problems forecasters face. For example, sup- 
pose that a utility company wanted to forecast the demand for electricity 10 
years in the future. Because there is obviously a strong relationship between the 
number of residential, industrial, and agricultural customers and the demand 
for electricity, knowing the numbers of each kind of customer will provide a 
basis for instructive forecasts. However, those numbers would have to be fore- 
casted themselves, because the number of customers affects demand contem- 
poraneously. These and other problems are discussed in a broad social science 
context by Berk and Cooley (1987]. 

9. This conception of policy-driven research apparently causes considerable 
misunderstanding about the relationships between basic and applied social 
research. Policy-driven research tries to determine how changes in policy can 
affect the phenomenon in question. In contrast, knowledge about the phenome- 
non per se (the province of basic disciplinary concerns) may have no ready links 
to what can be done about it. For example, a study finding convincingly that vio- 
lent criminals often were abused as children does not by itself lead to rehabilita- 
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tion programs for violent criminals or to concrete interventions in the homes of 
abused children. However, such a study might stimulate ideas for the kinds of 
policy-driven research necessary to develop sensible responses. That is, basic 
research may provide general clues about where and how to intervene. 

10. Randomization also means that the assumptions for routine significance 
tests are likely to be met. " 

11. Fairweather's efforts were not totally in vain. The basic understanding 
gained about what is needed to sustain chronically mentally ill patients outside 
institutions has made important contributions to the treatment of deinstitu- 
tionalized former patients. 


4 


Examining Ongoing Programs: 
A Chronological Perspective 


Once a program has been enacted and is functioning, one of the 
main questions is whether the program is functioning properly. 
Attention is not directed to whether the program is achieving its 
intended effects but to whether the program is operating day to 
day as expected. Often explicit is a comparison between the pro- 
gram as designed and the program as it is actually implemented. 
For example, even well-planned programs often have to be fine- 
tuned in the first few months of operation. (Indeed, estimates of 
effectiveness, therefore, should be made only when any neces- 
sary "shakedown period" is over.) 


Stage 1: 
Is the Program Reaching the 
Appropriate Beneficiaries? 


Achieving appropriate coverage of beneficiaries is often 
problematic. Sometimes a program is so poorly designed that it 
simply does not reach significant portions of the total intended 
beneficiary population. For example, an educational program 
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designed to reach intravenous drug users through community — — 
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institutions such as churches and schools may simply miss its 
target population, which does not use the community institu- 
tions. A program to provide food subsidies to children who 
spend their days in child-care facilities may fail to reach a large 
proportion of such children if regulations exclude child-care 
facilities that are serving fewer than five children. A very large 
proportion of children who are cared for during the day outside 
their own households are cared for by women who take a few 
children into their homes (Abt Associates 1979). 

A thorough needs assessment of child-care problems would 
have revealed that such a large fraction of child care was furnished 
by small-scale vendors and, hence, should have been taken into 
account in drawing up administrative regulations. However, the 
needs assessment might not have been thorough enough. In addi- 
tion, patterns of the problem may change over time, sometimes 
in response to the existence of a program itself. For example, it 
is quite likely that the existence of shelters for battered women 
increases the demand for shelters. Among other things, shelters 
validate the option of leaving oppressive living arrangements. 
Another example concerns the labeling of consumer products. 
Labels printed in extremely small type or that use professional 
jargon may satisfy agency regulations; they may also be ignored 
by most consumers. The labeling program simply does not reach 
many of its intended beneficiaries. In short, it is important to 
review from time to time how many of the intended beneficiaries 
are in fact being covered by a program. 

Experience with social programs over the past two decades 
has shown that there are few, if any, programs that achieve full 
coverage or even near full coverage of intended beneficiaries, 
especially where coverage depends on actions that must be 
undertaken by prospective beneficiaries. Thus not all persons 

who are eligible for Social Security payments actually apply for 
them; estimates indicate that up to 15% of all eligible 
beneficiaries never apply. AFDC programs only reach about half 
of the families who are eligible. Some intended beneficiaries 
may not be reached because facilities for delivering the services 
are not accessible. A single job training program for all of the 
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State of Iowa that is located solely in Dubuque effectively does 
not exist for individuals who live more than 50 miles away. 

There is also another side to the coverage problem. Programs 
may extend benefits to persons or organizations that were not 
intended beneficiaries. Such unwanted coverage may be impos- 
sible to avoid because of the ways in which the program is deliv- 
ered. For example, although Sesame Street was designed primar- 
ily to reach disadvantaged children, it was attractive to 
advantaged children and to many adults. There is no way to keep 
anyone from viewing a television program once broadcast (nor is 
itentirely desirable to do so in this case), and hence a successful 
TV program designed to reach some specific group of children 
may reach many others as well (Cook et al. 1975). 

Although the unintended viewers of Sesame Street are 
reached at no additional cost to broadcasters, there are times 
When "unwanted" coverage may severely drain program re- 
Sources. For example, while Congress may have wished to pro- 
vide educational experiences to returning veterans through the 
GI Bill and its successors, it was not clear whether Congress had 
in mind the subsidization of the many new proprietary educa- 
tional enterprises that came into being primarily to supply 
"vocational" education to eligible veterans. Or, in the case of the 
bilingual education program, many primarily English-speaking 
children were found to be program beneficiaries because some 
School systems discovered that the special bilingual classes 
were an excellent place to tuck away their trouble-making 
English-speaking students. 

Studies designed to measure coverage are similar in principle 
to those discussed under needs assessment studies earlier. For 
example, a utility company might survey its customers to deter- 
mine who is taking advantage of an advertised rebate for install- 
ing better home insulation. Or a telephone company might review 
its own records to see how many of its customers are taking advan- 
tage of “lifeline” rates. Or a university might examine its admis- 
Sions records to determine if affirmative action programs are being 
applied inappropriately to non-covered minority groups (e.g., 
Asian Americans). Perhaps the main difference between cover- 
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age studies and needs assessments is that for the former there will 
more likely be systematic records on which to build. That is, the 
existence of a functioning program often implies the existence 
of program records with useful information. 


Stage 2: 
Is the Program Being 
Properly Delivered? 


Program Integrity Research 


It is far easier to describe a program than deliver it. Especially 
when program services depend heavily on the ability to recruit 
and train appropriate personnel, to retrain existing personnel, oF 
to undertake significant changes in standard operating proce: 
dures, it is sometimes difficult to implement the intervention 
as designed. And one cannot always rule out incompetence or 
outright corruption. But whatever the reason, a program that is 
not delivered as it was intended subverts the earlier develop- 
mental effort and spends money on false pretenses. 

Several examples may highlight the importance of program 
integrity. Although informational pamphlets on proper nutri- 
tion can be provided to medical personnel, pharmacies, and 
hospitals, the distribution of such literature to patients is 
always problematic. Properly motivating personnel to add the 
distribution of pamphlets to their existing duties is necessary 
but difficult to accomplish. And if the pamphlets are not deliv- 
ered, there is no program. Likewise, when an educational pro- 
gram on birth control requires that special equipment be used, 
as in the case of the distribution of video- and audiocassettes, 
delivery of the program can be made problematic, In some 
instances, for instance, an assumption that schools have the req- 
uisite equipment may be false. i 

In other cases, the anticipated services are delivered, but in 
diluted form. For example, a supplementary reading instruction 
program may be designed for an average of two instructional hours 


per student per week. However, in practice, 30 minutes of the pro- 
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gram may be delivered on the average. The 75% reduction may 
lower reading gains proportionally, in which case the program's 
impact may be trivial. Or worse, the 7596 reduction may drop the 
program below a threshold at which any gains occur. 

Program integrity is often a particular problem in "loosely 
coupled” organizations in which the lines of authority are 
unclear or in which the lines of authority mean little in prac- 
tice. Academic departments in universities are an excellent 
example. The professional autonomy given to professors and the 
ideal of academic freedom mean that department chairs often 
have little control over what is taught in classrooms. Many human 
Service organizations have similar problems: hospitals, police 
departments, courts, welfare departments, and secondary schools. 
In all such organizations it is difficult to control what is occur- 
ring at the point of service delivery because of the discretion and 
autonomy given to service workers. To take another example, 
despite the threat of AIDS and other blood-transmitted diseases, 
it is often difficult to get emergency room nurses to always use 
Surgical gloves when handling patients. 

Evaluation research designed to measure what is being deliv- 
ered may be simple or complex. Thus it may be very easy to 
learn from hospitals how many persons are served each week in 
their various outpatient services, but very difficult to learn pre- 
cisely what transpires in the interactions between medical per- 
Sonnel and patients. For example, if one is interested in the 
kinds of information provided by physicians and nurses in out- 
patient care, one would have to undertake an in-depth observa- 
tional study that might well be very expensive to implement on 
a large scale. As another illustration, consider an evaluation of 
efforts to teach literacy as part of vocational training. One key 
question might be whether a particular pedagogical approach 
Was being employed as promised. If only about six classes were 
being studied, two full-time observers would probably be needed 
to do classroom observation. In addition, there is always the pos- 
sibility that the presence of observers may alter the behaviors of 


teachers and students. 
One of the best examples of systematic studies in difficult-to- 
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observe situations is Reiss's (1971) study of police-citizen encoun- 
ters. Research assistants were assigned to ride with police on 
patrol and to systematically record each encounter between the 
police and members of the public. Reiss's study provides basic 
descriptive accounts of how such encounters are generated, how 
behavior of citizens affected police responses, and so on. 

A recent example of an excellent implementation study 
examines the mental hospitals that serve the Chicago metro- 
politan area (Lewis et al. 1987). The main problem was to 
describe how the legislation and rules for involuntary commit- 
ment to mental hospitals in place since the 1970s were working 
out in practice. The researchers discovered that fewer than 1% 
of the patients admitted over a year's time were involuntarily 
committed. Observing the court procedures, they found that 
many persons brought to the attention of the police because of 
their bizarre or aggressive behavior were offered the choice of 
voluntary commitment for up to 30 days or being involuntarily 
committed for 60 days or more. The courts and prosecutors 
offered these alternatives because involuntary commitment 
involved lengthy procedures that could appreciably reduce the 
number of cases the court could process. Given the choice, most 
persons brought in under complaint choose the more lenient 
alternative. These practices averted what might potentially 
have been a very high burden on the courts and prosecutors. 

To fine-tune a program, it may not be necessary to collect data 
ona large scale. It may not matter, for instance, whether a partic- 
ular implementation problem occurs frequently or infrequently, 
because it is not desirable for it to occur at all. Thus small-scale, 
qualitative observational studies may be most fruitful for pro- 
gram fine-tuning. For example, if qualitative interviews with 
welfare recipients reveal any instances in which husband-wife 
separations were undertaken solely for the purpose of retaining 
or increasing benefit eligibility, there might be sufficient evi- 
dence for revision of the program eligibility rule. 

Programs that depend heavily on particu] 
delivery and/or that involve complicated act; 
call for individualized treatments for bene 


ar personnel for 
Vities and/or that 
ficiaries are espe- 
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cially good candidates for careful and sensitive fine-tuning 
research. Such programs imply that the unique characteristics 
of program personnel coupled with the unique characteristics of 
beneficiaries effectively determine what is delivered. Because it 
is impossible to standardize the program, it is difficult to con- 
trol what is delivered. Thus individualized human services are 
especially problematic. (See Fairweather and Tornatzky 1977 for 
an outstanding example.) 


Stage 3: 
Are the Funds Being 
Used Appropriately? 


Fiscal Accountability 


The accounting profession has been around considerably 
longer than has program evaluation; procedures for determining 
Whether program funds have been used responsibly and as 
intended are well established and, hence, are not problematic. 
However, assessments of fiscal accountability cannot substitute 
for the studies mentioned above. Proper use of funds does not 
necessarily imply that program services are being delivered as 
intended. Conventional accounting categories used in fiscal 
audits are ordinarily sufficient to detect fraudulent expenditure 
Patterns, but they may be insufficiently sensitive to detect 
Whether services are being delivered to appropriate beneficiaries 
at the recommended levels. As described earlier, for instance, just 
because salaries of emergency room social workers were paid as 
Promised did not mean that the social workers were delivering 
the promised services. Recall that the social workers were com- 
monly used on the wards instead of in the emergency room. In 
this light, it is instructive that the General Accounting Office has 
Set up a separate section, called the Program Evaluation and Meth- 
odology Division, one of whose major roles is to instruct GAO 
personnel in appropriate evaluation procedures and to undertake 
evaluations of programs upon the request of Congress. 

It is also important to keep in mind that the definition of 
Costs under accounting principles differs from the definition of 
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costs used by economists. For accountants, a cost reflects con- 
ventional bookkeeping entries such as out-of-pocket expenses, 
historical costs (i.e., what the purchase price of some item was), 
depreciation, and the like. Accountants focus on the value of 
current stocks of capital goods and inventories of products cou 
pled with “cash flow" concerns. When the question is whether 
program funds are being appropriately spent, the accountants 
definition will suffice. ; 

However, economists stress opportunity costs defined m 
terms of what is given up when resources are allocated to partic 
ular purposes. More specifically, opportunity costs reflect the 
next best use to which the resources could be put. For example, 
the opportunity cost of raising teachers’ salaries by 10% may be 
the necessity of forgoing the purchase of a new set of textbooks. 
While opportunity costs may not be especially important from 
a cost-accounting point of view, they become critical when cost 
effectiveness or benefit-cost analyses of programs are under- 
taken. We will have more to say about these issues later. 

The three evaluation tasks just discussed are directed mainly 
to how well a program is functioning. Whether or not a program 
is effective is a different question, to which answers are not ea 
ily provided. Essentially, one must determine whether or not a pro- 
gram is achieving its goals over and above what would be expected 
if the program did not exist. We turn to that enterprise now. 

Many evaluators consider the effectiveness question to be 

quintessentially evaluation. We suspect that this derives in part 
from the laboratory roots of many evaluation research tech- 
niques. In the laboratory, the treatment and control conditions 
are usually under the control of the researcher and, as a conse- 
quence, are not problematic. The researcher knows what was 
being delivered to whom. The "real" question, therefore, 
becomes whether the treatment had any impact. However, 
social programs are not launched in laborator 
content is often the critical issue. Indeed, one 
unless there are ways to determine Precisely w 
to whom, program impact is irrelevant: 
without knowing its cause? 


ies, and program 
could argue that, 
hat was delivered 
What good is an effect 
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. Suppose, for example, that one wanted to evaluate efforts to 
introduce literacy training into vocational training classes. Also 
Suppose that vocational trainees are assigned at random to two 
classrooms, one of which is to teach the usual vocational content 
and one of which is to integrate literacy and vocational training. 
Finally, suppose that, while there are absolutely no data on what 
went on in the two classrooms (either from observation, accounts 
from students, accounts from teachers or other sources], later 
reading scores for the integrated curriculum are far higher than 
for the vocational curriculum alone. That is, there is convincing 
evidence of program impact. However, without knowing about 
treatment content, what could possibly be done with the results? 
It is impossible, for instance, to use these results to justify rou- 
tinizing the program, because no one but the students and teach- 
ers have any idea what the program is. Routinize what? 

This illustration conveys why, in our view, questions about how 
a program is functioning logically precede questions about pro- 
gram impact. An impact assessment is a waste of time unless the 
intervention is understood. Thus there is certainly no justifica- 
tion for interpreting every evaluation task in effectiveness terms, 
as some evaluators have done in the past, spurred by imprecise 
requests for help from policymakers and administrators. 

Once the treatment is well documented, however, the success 
or failure of that program is quite properly addressed. The prover- 
bial “bottom line” is always whether the program “works.” We 
turn, then, to ways in which program effectiveness may be empir- 
ically examined. 


Stage 4: 
Can Effectiveness Be Estimated? 


The Evaluability Question 


The effectiveness of a program that has gone through the 
Stages described earlier in this chapter should, in principle, be 
an answerable empirical question. Put a bit more cautiously, an 
impact assessment will not be precluded. But there are many 
human services programs that present problems for effective- 
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ness studies because one or more of the stages described earlier 
was neglected or handled poorly. Perhaps most important, an 
impact assessment is impossible without well-formulated pro- 
Bram objectives. For example, a program designed to increase 
learning among certain groups of schoolchildren through the 
provision of supplemental per capita payments to schools is not 
evaluable without further specification of goals. “Increase learn- 
ing’ is hardly very specific. One would need to know such things 
as what sort of "learning" was to be included and what a nom 
trivial "increase" entailed. 

Even biomedical experiments are not immune to vagu e 
goals.' Freedman and Zeisel (1988) describe testing the claims 
that a certain chemical is alleged to increase the risk of cancer, 
assuming that this claim may be evaluated with a randomized 
experiment using mice as subjects. A perplexing question qd 
presented itself: How should they define the outcome variables: 
Carcinogens are often rather specific in their impact; one may 
be associated with cancer of the liver and another with cancer o 
the lungs. For an experiment at hand, which cancers should be 
counted? If, for example, all cancers are counted, an apparent 
finding of “no effect" may be misleading. Small but important 
effects for a particular kind of cancer may be lost in the "noise 
when all tumors are aggregated. In other words, the outcome 
Should have been stated in terms of the particular kinds of 
cancers anticipated, not cancer in general. 

Clarifying goals can often be accomplished by helping pro- 
gram personnel to articulate them. This may mean several 
hours of conversations over a number of weeks. For example, a 
Bay Area program of workshops on domestic violence designed 
for judges had as its initial goal “making judges more sensitive to 
family violence cases." Did this mean changing sentencing pat- 
terns, providing through counselors emotion. 
tims, reducing the number of continuances 
on victims), or what? It took several meetin 
tor, agency personnel, and the agency's advi 


program's goals were properly clarified. Ye 
lutely essential. 


al support for vic- 

[which are very hard 
85 among the evalua- 
Sory board before the 
t this step was abso- 
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A second criterion for evaluability is that program content be 
well specified. Thus a program "encouraging innovation" to 
make health education agencies more effective is not amenable 
to an impact assessment. In addition to vague goals, the means 
for reaching the goals are unclear. "Innovation" is not a method 
but a means of proceeding. And because anything new is an 
innovation, the health education program may encourage the 
temporary adoption of a wide variety of techniques likely to 
vary widely from site to site. In short, it must be clear what the 
intended intervention is. 

Third, a program's impact may be estimated only if it is possi- 
ble to credibly approximate what would have happened to the tar- 
&cted recipients in the absence of the program. (See our earlier dis- 
cussion of causality.) For example, randomized experiments are 
a powerful means to make causal inferences about the impact of 
Social programs, but, more generally, constructing comparison 
groups of various kinds, whether by random assignment or not, 
is usually essential. Hence, a program that is universal in its cover- 
age and that has been going on for some period of time is very 
difficult (perhaps impossible) to evaluate for effectiveness. One 
cannot evaluate, therefore, the effectiveness of the public school 
Systems in the United States because one cannot find American 
Cities, towns, counties, and states that do not have (or recently 
have not had) public school systems. i 

To illustrate further the need for comparisons, a county in 
Northern California wanted an impact assessment of prosecu- 
torial efforts to increase the likelihood that serious drug offenders 
would be sanctioned severely and swiftly. One of the evaluation 
Outcomes was citizens fear of crime; presumably, swift and severe 
Sanctions would bring down the crime rate, at least for drug- 
telated offenses. Unfortunately, the evaluation was requested after 
the program began, and no pretest of citizen attitudes was possi- 
ble. Without a pretest, it is simply impossible to tell whether the 
Program possibly made any difference. 

Finally, effectiveness evaluations are often the most difficult 
kinds of evaluations, requiring highly trained personnel and, 
Sometimes, large sums of money. Thus it is silly to plan evalua- 
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tions of program impact unless there are sufficient resources and 
unless appropriately trained professionals are available. Unfor- 
tunately, legislatures and administrators have often mistakenly 
required effectiveness evaluations from agencies that are not pre- 
pared to undertake them, often assuming as well that the costs 
would be modest (Raizen and Rossi 1981). For example, recent fed- 
eral legislation has required the National Institute of Justice to 
undertake an impact assessment of the large grants awarded to 
states and counties to"fight drugs." Yet, there was no accompany" 
ing appropriation and, at least informally, there were unrealisti- 
cally high expectations about what could be learned. ; 

There are no hard-and-fast rules about how much an effective- 
ness evaluation should cost or about how much skill may be 
needed. However, sometimes a useful starting point for discus 
Sions of research costs is to ask that the equivalent of at least 1% 
of the program's operating budget be available for program evalu- 
ation. For the requisite research skills, it is always helpful if the 
individuals who will be doing the evaluation have successfully 
done such research in the (recent) past; a track record is very 
important, far more important than formal credentials. 

Techniques have been developed (Wholey 1977) to determine 
whether a program is evaluable in the senses discussed above. 
Decision makers are well advised to commission such studies as 
a first step rather than to assume that all programs can be evalu- 
ated. Evaluability assessments essentially determine whether 
there are program goals that are sufficiently well articulated, 
whether the program is sufficiently clear and uniformly delivered, 
and whether the requisite resources are available. 

Finally, it may be worth mentioning that questions of evalua- 
bility have in the past been used to justify "goal-free" evaluation 
methods (e.g., Scriven 1972; Deutscher 1977]. The goal-free advo- 
cates have contended that, because many of a program's aims 
evolve over time, the "hypothetico-deductive" approach to impact 
assessment (Heilman 1980) is at best incomplete and at worst mis- 
leading. In our view, impact assessment necessarily requires some 
set of program goals; and whether they are stated in advance and/or 
evolve over time does have important implications for one’s 
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research procedures (Chen and Rossi 1980). In particular, evolv- 
ing goals require far more flexible research designs (and research- 
ers). In other words, there cannot be such a thing as a "goal-free" 
impact assessment. At the same time, we have stressed above that 
there are other important dimensions to the evaluation enterprise 
in which goals are far less central. For example, a sensitive 
monitoring of program activities can proceed productively with- 
out any consideration of ultimate goals. Thus goal-free evaluation 
approaches can be extremely useful as long as the questions they 
can address are clearly understood. 


Stage 5: 
Did the Program Work? 
The Effectiveness Question 


As discussed above, any assessment of whether or not a pro- 
gram "worked" necessarily assumes that it is known what the 
program was supposed to accomplish. For a variety of reasons, 
enabling legislation establishing programs often appears to set 
telatively vague objectives for the program, making it necessary 
(as discussed above) to develop specific goals during the "design 
phase." Goals for such general programs may be devised by pro- 
gram administrators through consideration of social science 
theory, past research, and/or studies of the problem that the pro- 
8ram is supposed to ameliorate. 

In whatever way goals may be established, the important 
point is that it is not possible to determine whether a program 
Worked without developing a limited and specific set of criteria 
for establishing the condition of "having worked." Beyond clear 
goals, therefore, there needs to be a rather clear concept of “how 
good is good enough.” For example, it would not have been possi- 
ble to develop an assessment of whether Sesame Street “worked” 
without having decided that its goals were to foster reading and 
number-handling skills. But, once that was determined, there 
still remained the vital question of how large a gain in perfor- 
mance was to be called a success. A quarter of a grade level? A 
half of a grade level? Two grade levels? 
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In other words, without specificity about the size of the pro- 
gram effect required, program evaluators are shooting at a mov- 
ing target. Without such specificity, there will not be enough 
information about the effects being sought to properly inform a 
number of critical design decisions. For example, there will be 
no way to determine the necessary sample size because the 
appropriate sample size determination depends in part on the 
size of the effect one is trying to find; smaller anticipated effects 
require larger samples. Likewise, it will be very difficult to 
decide how to measure the outcome variable. Again, the 
amount of precision depends on the size of the effect being 
sought. Then, when the time comes for analysis, unnecessary 
uncertainty is compounded. For example, one may well lose the 
ability to do significance tests designed maximally to address 
whether the program worked as hoped (e.g., Goodman and 
Royall 1988). One could not, for instance, define the null 
hypothesis in terms of the amount of effectiveness required for 
the program to be called a success. Finally, the evaluation report 
will be subject to a large number of ad hoc interpretations 
because the definition of success will often be person-specific. 
One person's success may be another person's failure. That is, 
two individuals examining the same empirical results may 
legitimately draw contradictory conclusions? 

Assuming, however, that "success" is properly defined, one 
must still respond to the reality that programs never succeed or 
fail in absolute terms. Success or failure is always relative to 
some bench mark. Hence an answer to the question: "Did the 
Program work?" requires consideration of the question: “Com- 
pared with what?" 

The "compared to what?" question is by now an old friend, 
introduced most thoroughly when we earlier considered impact 
assessment of new programs. Recall that impact assessments 
for new programs were generally best undertaken with ran- 
domized field experiments. By and large, randomized designs 
are still the method of choice, although for ongoing social pro- 


grams, random assignment faces a number of additional practi- 


cal obstacles. For example, Program recipients may feel entitled 
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to a service they have been receiving for some time. Consider. 
for instance, the availability of unmetered water in a number of 
rural communities. The switch to metered water would gener- 
ate a public outcry (and has in some locales), perhaps especially 
if coupled with random assignment. Access to water from local 
aquifers, rivers, streams, and lakes is, in many areas, part of the 
rights that historically have come with ownership of land. 

In short, it is time to briefly review alternatives to ran- 
domized experiments; we must allow for the possibility that 
Comparisons to the intervention may involve non-randomly 
Constructed groups of various kinds. Note, in addition, that 
While the alternatives are likely to be especially relevant to 
impact assessments of ongoing programs, they may also be used 
(as a second choice) in impact assessments of new programs. 

The development of appropriate comparisons can proceed 
along at least three dimensions: (1)comparisons across different 
Subjects, (2) comparisons across different settings, and (3) com- 
Parisons across different times. In the first instance, one might 
Compare different sets of persons, trying to hold constant the 
Setting and when the study is undertaken. In the second 
Instance, one might compare the performance of the same set of 
Persons in different settings—such as at home and at work 
(necessarily at two different points in time). In the third 
Instance, one might compare the same students in the same set- 
ting but at different points in time. 

_ Consider as an example different levels of aggregation 
Involved in school settings (individual students, classes, and 
Schools] and the time structuring of schooling (class periods, 
terms, and academic years). As Table 4.1 indicates, it is possi- 

le to mix these three fundamental dimensions to develop a 
Wide Variety of comparison groups. For example, comparison 
Stoup C; varies both the subjects and the setting although the 
time is the same. Or comparison group C, varies the subjects, 
the Setting, and the time. However, with each added dimension 

Y Which one or more comparison groups differ from the 
eXperimenta] group, the number of threats to the validity of the 
“sulting effectiveness estimates necessarily increases. For 
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example, the use of comparison group C, (different setting and 
different time period) requires that assessment of program 
impact simultaneously take into account possible confounding 
factors associated with such things as changes in student back- 
ground and motivation and such things as the "reactive" poten- 
tial of different classroom environments. 

As an illustration of the difficulties that often follow in the 
absence of random assignment, consider the evaluation (Robert: 
son 1980) of the effectiveness of high school driver education 
programs in which the goal was to reduce automobile accidents 
among 16- to 18-year-olds. Despite sympathy for the programs, 
the state legislature decided not to provide any funding. In 
response, some school districts dropped driver education from 
their high school curriculum and some retained it. Two sets of 
comparisons were possible: (1) accident rates for persons of the 
appropriate age range in the districts that dropped the program 
computed before and after the program was dropped, and (2) 
accident rates for the same age groups in the districts that 
retained driver education compared with the accident rates 1? 
districts that dropped the driver education program. 

It was found that the accident rates were significantly lowe? 
in those districts that dropped the program, a finding that might 
lead one to believe that the program increased the risk of accr 
dents, perhaps because young people were enticed to obtain 
licenses earlier. However, internal validity in this instance 
depends on considerable knowledge about the process by which 
some school boards dropped the program. In most cases, school 


Table 4.1 A Typology of Comparison Groups 


Same Subjects Different Subjects 
Same Different Same Different 
- ae Setting — Setting Setting Setting 
Same Time Xx? b 
3 : XX fon ren 
Different Time (en C, e e 
5 6 


a. No comparision is possible. 
b. Although logically possible, it is not sensible for human subjects, 
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boards apparently dropped the program because of financial con- 
Siderations. If (and only if) one can accept that local financial 
Concerns were unrelated to the number of automobile accidents 
a the past, then (the unfortunate) inferences about the impact 
of the driver education program may be taken seriously.5 
a Of Course, randomization will, on the average, eliminate con- 
Mu due influences inthe estimation of impact. On grounds of 
ytic simplicity alone, it is easy to see, therefore, why so 
dore of impact assessment strongly favor research 
diction, ased on random assignment. As noted earlier, however, 
Re assignment is often impractical or even impossible. 
at B ite random assignment is feasible, its advantages 
To iE Eco y assigning a relatively large number of subjects. 
redu Y assign only two schools to the experimental group 
rn E wo schools to the control group, for example, will not 
i ie average ap^ xia between experimentals and con- 
ie materialize. ; Consequently, one is often forced to 
pt statistical adjustments for initial differences between 
experimenta] and comparison subjects. Whether or not such 
Adjustments succeed is always questionable. 

What about using statistical controls? Unfortunately, appro- 
Priate statistical adjustments (in the absence of randomization) 
through multivariate statistical techniques require a number of 
assumptions that are almost impossible to meet fully in prac- 
tice. For example, it is essential that measures of all confound- 
Ing influences be included in a formal model of the program's 
pact, that their mathematical relationship to the outcome be 
Properly specified (e.g., a linear additive form versus a mul- 
tiplicative form), and that the confounding influences be mea- 
Sured without error. Should any of these requirements be vio- 
lated, one riske'serjous biasin any estimates of program impact. 

While we will have a bit more to say about multivariate 
Statistical adjustments later, suffice it to say now that there is a 
Browing consensus among statisticians that social scientists of 
Various stripes have routinely pushed statistical procedures well 


beyond where they are designed to go. (See, for example, the 
Summer 1987 issue of The Journal of Educational Statistics.) 
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Statistical procedures have far too often been applied to data 
that are not even remotely appropriate, relying on assumptions 
that have virtually no justification. Coming under particular 
criticism is the use of "structural equation models" (especially 
with "latent" variables); which regularly outstrip social sci- 
ence data and theory. At this juncture, perhaps the best advice is 
that fancy statistics is no substitute for random assignment, and 
statistical analyses should be simple and as close to the data as 
possible. For example, multivariate matching, when feasible, 
may be superior to statistical adjustments [often based on tech- 
niques such as multiple regression) because matching assumes 
no functional form between the explanatorylcontrol variables 
and the outcome (Rosenbaum and Rubin 1985).* 

It is sometimes possible either to solve or to partially bypass 
comparison group problems by resorting to some set of external 
criteria as a baseline. For example, it is common in studies of 
desegregation or affirmative action programs to apply various 
measures of equity as a "comparison group" (Baldus and Cole 
1977). Thus an assessment of whether schools in Black neigh- 
borhoods are being funded at comparable levels to schools in 
White neighborhoods might apply the criterion that disparities 
in excess of plus or minus 5% in expenditures per pupil indicate 
inequality (Berk and Hartman 1972). However, the use of such 
external baselines by themselves still leaves open the question 
of causal inference. It may be difficult to determine if the pro 
gram or some other set of factors produced the observed rela- 
tionship between outcomes of interest and the external metric. 
For example, the lower funding of schools in Black neighbor- 
hoods may stem from discriminatory policies of the school 
board or the greater seniority, and, therefore, higher salaries, of 
teachers working in the schools of White neighborhoods. 


Some Research Designs for 
Estimating Effectiveness 


The discussion of comparison 


; group strategies in the last few 
pages has necessarily been couch, 


ed in relatively abstract terms. 


Examining Ongoing Programs: A Chronological Perspective 81 


The actual practice of choosing among such strategies leads to 
a large variety of research designs. A typology of research design 
types commonly used for assessing the effectiveness of pro- 
grams is shown in Table 4.2. 

There are two dimensions to the typology: (1) what is known 
about the mechanism by which some units (e.g., people] were 
exposed to the program and some units were exposed to the con- 
trol condition, and (2) how a causal effect may be operationalized. 
Regarding knowledge of the assignment mechanism (middle 
column in Table 4.2), there are three possible situations. First, 
the mechanism may be known, but the result [i.e., assignment to 
the experimental or control condition) cannot be known in 
advance. That is, the mechanism is stochastic. The equivalent of 
a coin flip is an illustration. Second, the mechanism may be 
known, and it is possible to know the result in advance as well. 
That is, the mechanism is deterministic. Assigning solely on the 
basis of some observable characteristic such as income is an 
illustration; individuals with incomes below some threshold 


Table 4.2 A Typology of Research Designs 


Research Assignment Treatment 
Design Type Mechanism Effects 
I. Randomized ("true") Known- Ye- Yc 
Experiments Stochastic 
Il. Regression- Known- (YelA) - (YclA) 
discontinunity deterministic 
III. Interrupted Unknown- (YaIT, V) - (Y4IT, V] 
time series hypothesized 
IV. Cross-section Unknown- (YelX) - (YcIX) 
hypothesized 
V. Polled Cross-Section Unknown (Yr.o] T, V. X) - 


time series (panel) hypothesized (Yo, Ye a1 T, V. X) 


S MEA " me; E = experimental group; C= comparison group; 
E T EENE = before the intervention; t2 = after the intervention; 
X= confounded variables covariates; T = trends; and V = events. 
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may be given income subsidies and individuals with income 
above some threshold may not be given income subsidies. 
Third, the assignment may be unknown but hypothesized. us 
example, there may be a number of factors determining whic 
households adopt recycling practices and which do not. It may 
be impossible to know exactly what those factors are, but it i$ 
certainly possible to develop informed hypotheses. 

Under the operationalization of causal effects, there AI : 
number of different possibilities. The most important gen 
ences, however, depend on whether the causal effect is define 1 
in terms of cross-sectional comparisons or longitudinal compar 
isons and on what additional information may be taken ae 
account to make the comparison “fair.” For example, we will se 
that for randomized experiments, the usual comparisons are 
Cross-sectional, and fair comparisons require nothing more than 
knowing what intervention was received by each unit. on 
designs are more complicated to analyze. In any case, for each 
design there are several ways to define a treatment effect (e.£« t 
a difference or a ratio). Those listed in the last column of Table 


; si- 
4.2 are among the most common and will suffice for expo 
tional purposes. 


Design Type I: 
Randomized (“True”) Experiments 
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The uncertainty creates no problems, however, because, in effect, 
assignment results from a fair lottery, which on the average makes 
the units assigned to the experimental group the same as the units 
assigned to the control group. That is, all external confounding 
influences are eliminated; the groups are on the average compara- 
ble before the intervention is introduced. 
p ete have a number of other assets (Berk et al. 
if PS most important, appropriate estimates of treat- 
rsa Impact can be obtained from a simple comparison 
E the "average" outcome for the experimentals and the 
ge" outcome for the controls. The "simple" comparison 
may be ^ difference (as shown in Table 4.2) or a ratio.? And the 
average” may be a mean, median, or any other sensible measure 
of central tendency. One might in an experiment on the impact 
of a job training program, for instance, use the difference in 
Postintervention median income between the experimentals 
and controls. Another important advantage of randomized 
experiments is that, typically, the assumptions necessary for 
Statistical inferences are likely to be met. 

This is not to say that randomized experiments necessarily lead 
to straightforward results. Everything depends on the random 
assignment being implemented as designed. If, for instance, 
agency personnel override the random assignment, even with the 
best of intentions, the unique assets of true experiments are at 
least debased (Berk and Sherman 1988). For example, in an exper- 
iment undertaken in Detroit on the deterrent impact of arresting 
Shoplifters, individuals apprehended for shoplifting by depart- 
Ment store security personnel were to be assigned at random to 
One of two conditions: (1) arrest and (2) reprimand and release. 
However, the assignment pattern was initially alternating: odd 
Cases received arrest and even cases received reprimand and 
release, As a result, store personnel were often able to anticipate 
the assignment outcome and some used this information to pair 
Particular individuals with particular treatments. With perhaps 
the best of intentions, they were trying to make sure that accused 


Shoplifters got ^what they deserved." 
Still even if units are actually assigned at random, two addi- 
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tional assumptions must be made (Rubin 1986). First, the wo 
jects must not be affected by the assigament mechanism itse Fi 
For example, suppose that, in a job training experiment, indivi 
uals who are randomly assigned to the control condition p 
job referral assistance only) misinterpret the assignment as : 
assessment of their potential. That is, they believe, tpe 
that they were less deserving than individuals assigned to t d 
control group. Then, a resulting reduction in self-confidenc 
may translate into poorer performance in the job market. * 
Second, one must assume that intervention received by vi 
experimentals has no impact on the controls, and vice versa. e 
example, suppose one wanted to test the impact of teachi h- 
mathematics in a new way to primary school students. If teac 
ers in the control group are threatened by the new technology, 
they may just work harder within their conventional curricu 
lum to"show" administrators that the new approach is unneces 


dips ; arly 
sary. This is sometimes called a "John Henry Effect" and cle 
affects treatment content." 


Tue experiments: caveat 
emptor. 
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Design Type II: 
Regression Discontinuity 
(Assignment by Observed Variables) 


Some programs are administered using a clear set of rules for 
selecting participants. For example, some college fellowship 
Programs allocate fellowships on the basis of scores received on 
standardized tests (e.g., the National Merit Scholarship Test). In 
à similar fashion, eligibility for food stamps is determined by 
Income. Likewise, access to privileges in prison is often decided 
by the number of disciplinary infractions. Note that in all three 
illustrations there is, at least in principle, a threshold that . 
cleanly and definitively determines whether benefits are 
Provided. Individuals above (or below) the threshold receive 
benefits, while individuals below (or above) the threshold do 
Dot. That is, there is no uncertainty in the assignment process. 

If such administrative rules are followed faithfully, it is possi- 
ble to obtain fair (unbiased) estimates of treatment effect if, in 
addition to the assumptions required for randomized experi- 
ments, one additional assumption is met. One must assume 
Some functional form for the relationship between the variable 
used to determine who gets support (e.g., test scores) and the 
Outcome (e.g., grade point average in college]. A linear form or 
Simple polynomial is commonly used. 

The reason for the additional assumption is easily under- 
Stood. Recall that, in the case of random assignment, the 
experimentals and controls were on the average comparable. 
When assignment is determined by some threshold on an 
Observed variable such as a test score, however, there is good rea- 

Son to suspect that the experimentals and controls are not com- 
Parable (and hence "control" group is not really appropriate). 
Other things being equal, for instance, students with higher test 


Scores may well perform better in college. 
The solution is to use information about the relationship 
between the assignment variable and the outcome to infer how 


the comparison group would have performed if their values on 
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the assignment variable were on the average the same as ke 
of the experimental group. If, for example, the relations 1p 
between test scores and later grade point average is linear, ees 
can easily extrapolate what the grade point averages of the PM 
parison group would have been had they had test scores pas 
cal on the average to the experimental group. Then, ers 
extrapolated values may be compared to the observed iu 
point averages of the experimentals. For example, one cen 

compute the simple difference between the two. Huet a 
if the functional form is wrong, the extrapolations will » 
wrong, and, as a result, the comparison will be misleading. E 
practice, there is often evidence in the data that may make on i 
functional form more plausible than others. It cannot be rd 
emphasized, however, that there will be no experimental an 3 
comparison group members with the same values on the assign 
ment variable. Consequently, there are no means to verify 
empirically that the assumed functional form is correct. Ev" 
dence for a particular extrapolation is a long way from proof that 
the extrapolation is accurate. 2 

The argument just made is simply summarized in Table 4- 

in the column “Treatment Effects.” If one knows the assignment 
variable (or combination of variables) and how it was used (1.67 
the threshold), one may obtain unbiased estimates of the treat- 
ment effect after controlling (via statistical procedures such as 
analysis of covariance) for the assignment variable. That is, the 


treatment effect is conditional upon values of the assignment 
variable ("A"). It is in this process of making statistical adjust- 
ments that 


a functional form must be assumed and one is essen- 
tially looking for a discontinuity (or "jump" in the estimated 
regression line based on that functional form. Hence the term 
"regression discontinuity design." 


The regression discontinuity design is, for some, counter 


apparently, that after controlling 
nd only the assignment variable) 
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known exactly for each unit. In an analogous manner to random 
Né ipii controlling for the assignment variable, therefore, 
vers all relationships between variables related to the out- 
DER and the intervention assigned. Unbiased estimates follow. 
uu discontinuity designs are particularly useful 
mt am assign benefits on the basis of some measured 
estu " Wer s effort, a powerful quasi-experimental 
thin, cates ly S place (e.g, Berk and Rauma 1983). In addi- 
coe : n discontinuity designs are sometimes useful 
Soir o a experiments when random assignment is 
Lu is ically unacceptable. One may assign on the 
ved eed" or any other attribute as long as there is an 
tvable variable on which a threshold may be placed (for 
more details see Trochim 1984]. 


Design Type III: 

Interrupted Time Series 
P ciini time series designs are based on repeated mea- 

, Over time, of some outcome. Simply put, the idea is to 
Compare the time trend before an intervention with the time 
p after. For example, a downward trend in the conviction rate 
Or a particular jurisdiction may be reversed after more prosecu- 
tors are hired. Or air pollution levels downwind from a major 
Power plant may be relatively stable over several years until a 
dramatic drop materializes following the introduction of 
Cleaner-burning fuels. 

Time series analyses are especially important for estimating 
the net impacts of full coverage programs. Under full coverage, 
all the units that could be served are being served. Conse- 
quently, there are no reasonable comparison groups. However, if 
a relatively large number of observations are collected before 
and after the intervention, the earlier period provides a compari- 
son "group" for the later period. Thus it may be possible to study 
the effect of the enactment of a gun control law in a particular 
jurisdiction, but only if the evaluator has access to a sufficiently 


lengthy series of crime statistics for gun-related offenses, both 


before and after the law was enacted. Or the effects of changing 
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pricing policies on residential water consumption can be stud- 
ied by analyzing the consumption trends, if consumption data 
can be found before and after the pricing policy changes (Berk et 
al. 1981). Of course, for many interventions such long-term mea- 
sures do not exist. For example, there are no long-term, detailed 
time series on the incidence of certain acute diseases, making it 
difficult to assess the impact on those diseases of medicare or 
medicaid. $ 

The basic logic underlying the analysis or interrupted time 
series designs is quite simple. The time series before the inter- 
vention is analyzed so that temporal trends (or “patterns” more 
generally) before the intervention are characterized as accu- 
rately as possible. For example, the number of burglaries may be 
increasing at an increasing rate. Then, the preintervention 
trends are used to project what would have happened without 
the intervention. Finally, the observed trends after the interven- 
tion are compared with the projections.! In its simplest form 
the preintervention mean is compared to the postintervention 
mean, as shown in Table 4.2 (where "T" stands for trends and uy. 
stands for events]. 

While capturing the preintervention trends is a necessary 
condition for accurate estimates of treatment impact," it is not 
sufficient. One must also take into account events, in addition 
to the intervention, that are related to when the intervention 
was introduced and could affect postintervention trends. For 
example, a reduction in water consumption after an increase in 
the marginal price might be obscured by an overall increase in 
water consumption because of a major leak in the water distri- 
bution System. Or an apparent reduction in water consumption 
after an increase in the marginal price may really result from the 
installation of new water-saving irrigation technology that was 
he price increase was even contem- 


must be addressed. 


i e terms we use : s 
Put in th sed above, the "assignment" process i$ 
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e M timing. All units receive the intervention because there 
x E company, a household, a city, a school dis- 
ee * a he e individual (Kadzin 1982). What needs to be 
ie 2 es units receive the intervention and which 
fione viens E E M Coat is introduced. It is this assign- 
te: tha Ghote ef e e considered, and all variables related 
ene of the intervention, that also may affect the 
hits da Dep vr ee must be taken into account. We show 
aly aimee fi T y including as "conditioning variables" not 
the Vs are Seva T”) but confonnding events (“V”). In practice, 
tehna d. n into account” with multivariate statistical 

Th ques that are beyond the scope of this book." 
seed ie ees limitation on time series designs is the 
existing i is just, in the statistical analysis, for pre- 
ous with thei ("Ts") and vonis that are roughly contemporane- 
vaihe P EE OR ("Vs"). These trends cannot be known 
baonr e confidence that the assignment mechanism can 
ages. or either the randomized experiment or the regres- 
a E design. They must be hypothesized, 
dus M on social science theory, past research, and informa- 
rete the data on hand. And there is ultimately no way to 

y test whether the Ts and Vs taken into account are the 
on di a Ts and Vs that should have been taken into account. 
fo not er way, in addition to all of the assumptions required 

r randomized experiments, one must accurately hypothesize 
what Ts and Vs are relevant and, typically, the functional form of 
their relationships with the outcome. 

Another obstacle is that the number of preintervention and 
Postintervention observations must be sufficient to reveal 
accurately preintervention and postintervention time trends 
(More than 25 observations for each are sometimes recom- 
mended). For this reason, interrupted time series designs are 
often restricted to outcomes for which governmental or other 
groups routinely collect and publish the needed statistics. 


Despite these and other drawbacks, however, the interrupted 


time series design can be very effective when more powerful 


designs are not available. 
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Design Type IV: 
Cross-Sectional Designs 


Whereas interrupted time series designs were characterized 
by temporal variation only, cross-sectional designs are charac- 
terized by cross-sectional variation only. One is simply examin- 
ing whether two or more sets of units differ at some specific 
moment in time. One set, for example, might be cities that 
earlier passed rent control ordinances and another set might be 
cities that did not. Then the central empirical comparison 
might be between the current median vacancy rate for similar 
kinds of apartments in the two kinds of cities: the median 
vacancy rate for rent controlled cities versus the median 
vacancy rate for nonrent control cities. That is, the outcome 1$ 
measured at only one moment in time and only as a posttest. 
Hence, comparisons can only be between units at that historical 
moment. 

Much as in randomized experiments and regression discon- 
tinuity designs, one is interested in a comparison of unit$ 
exposed to the intervention with units not exposed to the inter 
vention. However, unlike true experiments and regression dis- 
continuity designs, the assignment mechanism must be 
hypothesized. And if the hypothesized assignment mechanism 
is not effectively the same as the true assignment mechanism, 
the comparisons between the €xposed and unexposed units will 
lead to misleading treatment effect estimates. 

Stated a bit differently, the problem is that members of the 
exposed and unexposed groups are not likely to be on the average 
comparable before the intervention is introduced. Insofar 2$ 
these differences are also related to the outcome measure, the 


conditional on "X." 


Consider, for example, an Ongoing program in the San Fran- 
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Cisco area to make judges more sensitive to the special nature of 
domestic violence cases. Judges are offered weekend workshops 
in which they learn about the nature of domestic violence and 
the special needs of both victims and offenders. As a result of 
these workshops, sentencing patterns are supposed to change. 
Now, Suppose that the potential change in sentencing patterns 
wipe estimated by comparing the sentences given by judges 
WO participate in the workshops with the sentences given by 
judges who do not {i.e., a cross-sectional comparison]. 

ac €xpositional purposes, assume that judges were simply 

Ouraged to volunteer for the weekend workshops." Clearly, 
not all judges in the state would participate, and those who did 
m well differ from those who did not. For example, judges 
likely : do dum about the issues might well be the ones more 
rods, unteer. And these judges might already sentence 

y. Alternatively, they might be individuals who would 
ave changed their sentencing practices anyway in the near 
future. How then can a fair comparison be made? 

For Simplicity, suppose that “concern about domestic violence” 
was the only factor affecting the likelihood that judges would 
Volunteer for the workshops. Clearly, if for each experimental 
Broup judge a comparison group judge could be found who was 
equally “concerned,” the two groups would be matched on the 
selection” variable on which participation depended. The two 
8roups would then be comparable person by person. For reasons 

riefly mentioned earlier, exact matching is often impractical and, 
as an alternative, statistical adjustments are often undertaken that 
€quate the experimental and comparison groups on the average. 
That is, the central tendencies (usually means) of the two groups 
are equated on factors that differ before the intervention is 
introduced, the two groups are made comparable on the average. 
In this illustration, the group of the judges who did not volunteer 
would have the same average level of “concern about domestic vio- 
lence’ as the group judges who did. With these adjustments made, 


y compared. 
sary adjustments are not easily made. In 


tions required in randomized exper- 


the groups can be fairl 
In practice, the neces 
addition to all of the assump 
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iments, one must assume that all of the variables affecting assign- 
ment and the outcome are known and measured. These vari- 
ables should be chosen by developing a “selection model" for how 
units are assigned to the experimental and comparison condi- 
tions. But because such a model is only hypothesized, internal 
validity is always in jeopardy. That is, there is no direct way to 
empirically validate the hypothesized selection model. 


Design Type V: 
Pooled Cross-Sectional and 
Time Series Designs (Panels) 


Randomized experiments and regression discontinuity 
designs may rely on cross-sectional (across units) and time 
Series (over time) information. Cross-sectional comparisons are 
made between individuals who are exposed to the intervention 
and those who are not. In addition, statistical power may be 
improved by including measures of the outcome variable before 
the intervention, That is, while preintervention measures of the 
outcome variable are often not needed for unbiased treatment 
effect estimates, “pretest” measures allow one to more easily 
Separate “real” treatment effects from “noise” Thus true 
experiments and regression discontinuity designs may capital- 
ize on longitudinal comparisons. 

Under pooled cross-sectional and time series designs (also 
known as “panel” designs), both cross-sectional and longitudinal 
comparisons may be made. However we only include those sit- 
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the outcome measure is the "T-cell count"; the lower the count, 
the more compromised the immune system. The drug is the inter- 
vention. For those who take the new drug, the prescription's date 
is recorded. In addition, information is collected on a number of 
physical and behavioral variables potentially related to both the 
taking of the drug and the conversion to AIDS: other drugs being 
taken, other illnesses, sexual behavior, the use of "recreational 
drugs," diet, exercise, and the like. For example, an infected indi- 
vidual who feels he is eating well and getting sufficient exercise 
may be less likely to see a physician and also less likely to "con- 
vert.” Or an infected individual who engages in “high-risk 
behavior,” such as intravenous drug use, may be more likely to 
See a physician (as a precaution] and more likely to “convert.” 
_ The analysis of the drug data may be initially seen as a set of 
interrupted time series analyses, one for each individual who 
took the new drug. As before, one can compare time trends 
before taking the drug with time trends after, and potentially 
confounding events, such as contracting another illness, would 
need to be taken into account. Just as in an interrupted time 
Series design, the simplest analysis would contrast the mean (or 
Median) T-cell count before the intervention with the mean (or 
Median) Tcell count after the intervention. RAD 
In addition, however, comparisons can be made across individ- 
tals between those who took the new drug and those who did not, 
much as in purely cross-sectional designs. And just as in purely 
Cross-sectional designs, all variables affecting treatment assign- 
ment (i.e, taking the new drug) and the outcome (i¢., Tcell count) 
would need to be known, measured, and used in any analysis of 
the data. In the simplest analysis, the conditional mean for the 
'ndividuals who took the drug would be compared with the con- 
"tional mean for the individuals who did not take the drug. 
A ro it is possible to do better than either the creep m 
"s S-sectional analyses alone. One can effectively com : 
to address both confounding temporal variables and con 
simple’ Cross-sectional variables. Agin dew Par dr 
Vidua]; n the mean Tcell count for t s ewh wete 
include the Tcell counts for indivi 
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not treated at all and the T-cell counts during the premteve n a 
period for individuals who ultimately were treated. The mean 1 
cell count for the "treated" individuals would include the b. 
counts for people who took the new drug in the postintervento! 
period. And both means would be adjusted for (i.e., would be ^ 
ditional upon) trends (“T”), roughly contemporaneous events ( i 
and assignment variables ("X"). That is, one would adjust for bo 
temporal and cross-sectional confounders. : t 
Because pooled Cross-sectional and time series designs Kk 5 
cally involve more data collection than either crose Secom a 
interrupted time designs alone, one might wonder when ra 
are worth the effort. In general, they should be undertaken W^ a 
resources allow, assuming that true experiments and eee 
discontinuity designs are not practical. First, thought © dity 
many interrupted time series, the prospects for external vali n 
are better than for a single interrupted time series. In our A 5 
example, it is possible to explore how well the drug works t 
large group of people, some of whom presumably vary in imP s 
tant ways. Second, thought of as a set of cross-sectional compa E 
sons arrayed over time, the prospects for external validity 7 


n . E n 
better than for a Single-point-in-time cross-sectional comp 
son. Again using our AIDS 


Stage 7: 
Was the Program Worth It? 


The Cost-Effectiveness Question 


In the previous chapter, the is 


: Sue of cost-effectiveness W2° 
addressed. We have little to add h 


ere because the issues for neW 
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bn Mu programs are much the same. Perhaps the 
iio. Seu ave for ongoing programs, there is often 
Seu. os he about the long term to take into account. For 
Programs ma F as staff salaries are tied to seniority, ongoing 
in d ee more expensive in constant dollars as 
fee ha gs ida Ws program, there is very little informa- 
haul.Ina ies Lue ls of processes that unfold over the long 
beiei. cob ashion, there will be more information about 
eerie ies ^ ng mh returns to job training, for example, may 
Sites eae ata rather than hypothesized from theory or 
counting. havi a past research. Given the central role of dis- 
p» es T r- ing data on such long terms costs and benefits can 
ul. 


Notes 


1, D 
eran dpa because physical outcomes are supposed to be more easily 
: This can i measured than behavioral outcomes. 

advance, How Jappen even if the definition of success is clearly stated in 
about how di when there is clear and explicit agreement among key actors 
tation of Sita is good enough, there will more likely be one "official" interpre- 
analyzed us ed Moreover, a clear definition of success before the data are 
Someone “ nds additional credibility to the results. It is harder to argue that 
holders shi diy the results come out in a particular way. Ideally, central stake- 
Tesearch ould settle on how success (or failure) will be defined and then let the 

I Process independently unfold. 
CAM important to keep in mind that the concepts being illustrated are 
indivi: spplicable; for example, in the adult population, one can granen h 
tion a. au households, neighborhoods and cities for different levels of aggrega- 

oa life cycle stages for the time periods of adult life. x^ s 
Ron € use the term "comparison group" as à general term to be distinguished 

Control group.” Control groups are comparison groups that have been con- 


structe 

mes by random assignment. s A 
mos The definition of a confounding variable requires that it be related to both 
€ assignment of treatment and control conditions (which it was in this case) 


a è ETS 
nd the Outcome (which it may not have been in this case]. 
mized, not students, it is necessary to have 


6. Becaus i d 
e classes are being rando: i 
A relatively large number of classes randomly allocated to the experimental and 
Control conditions to be assured that the two sets of classrooms are approxi- 
Matel i is applied. 
Y equi e the treatment p] : 
qien batt 1 techniques for representing in 


- Struct i dels are statistica 
ural equation mot x 
Mathematical js di one’s theory about how an outcome 1s produced. The equa- 


tion for a straight line taught in elementary algebra (y = mx +b) is an 
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E iate if, in 
illustration of a very simple structural equation and would whee relevant 
part, the treatment (x) affected the outcome [y] in a linear fas m treatment 0f 
data would include observations for x (e.g., memberships in mild be used t9 
control groups) and y (e.g., earnings). Statistical procedures ba he line is ep 
estimate the values of the two parameters, m and b. The ira t ceo random 
sented by m, and it is the estimate of program impact. In the al Dp controvel” 
assignment, structural equation models are far more complicate! Sachen "indict" 
sial. A latent variable is a variable that is not directly measured. eaor of t 
tors” of the latent variable are measured. For example, an em number 9 
amount of crime in a neighborhood (a latent variable) may be 
reported crimes. . . No doubt 

8. Matching is an underutilized technique in evaluation — for very 
its lack of popularity stems from a belief that matching is only Qum 
large samples (e.g., 25,000 cases). Suppose, for example, pu ; 
experimental group member who is White, unmarried, unemp. member 
with two children under three, one must find a comparison group se that one 
has exactly the same configuration of characteristics. And sa E likely, ê 
must find a match for every member of the experimental group. n the nece* 
very large number of individuals would have to be screened for a h as "nearest 
sary matches to be found. However, with new developments S rally identic! 
neighbor matching,” matches can be approximate and not oe less seat ^ 
Nearly the same levels of internal validity is achieved but with am 1985). . 
ing. Hence, far smaller samples can be used (Rosenbaum and Ru " effects” in 

9. We are not considering here the possibility of bocas 
which the treatment has more impact for some units than others. h a long his 
an arrest may deter first-time wife beaters but not individuals M veryl y 
tory of spousal violence. While interaction effects are common an! are consi" 
tant for policy purposes, they are beyond the scope of this book. sis n 5): 
ered, however, in a number of standard references (e.g., Rossi and Ere rvention? 

10. One way to prevent such problems is to “blind” subjects to wo gro" 
they receive. This is common in clinical trials in which the contable from 
receives a “placebo” that, to the experiment's subjects, is indistinguisha w what 
the “real” treatment; neither the experimentals nor the controls Wo liqui 
group they are in. For example, both groups may be asked to drink a m! 7 ental 
twice a day that looks and tastes the same whether or not the exper! i 


aha : come 
medicine is included. There are closely related issues in how the out tion 
measured. In particular, i 


it is important to prevent knowledge of the ie 
received to affect how the outcome is recorded. For example, it is comm des 
clinical trials to prevent o will be diagnosing the outcome ! ntà 
(e.8., cancer) from knowing whether patients are members of the experime i 
know which patients are members rove 
may be inadvertently inclined to see imP e 
ments when there really is none. When the experimental subjects do not kn d" 
whether they are experimentals or controls, the study is called a "single piin 
experiment. When the individual: 


$ doing the measuring are also in the dark, ¢ 
study is called a “double blind" experiment. E 
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11. Actually, what is done is to remove (residualize" or "filter"] the preexist- 
ing trends so that comparisons between the preintervention and postinterven- 
tion can be fairly made. i 

12. The preintervention trends may be captured by observable patterns Tarh 
Preintervention time series e.g., a linear increase over time) or by variables 
explaining the trend. For example, increases in water consumption may be a 

nction of population increases. Insofar as population has been measured over 
the Preintervention period, it can be used to model trends in consumption. 

13. One can Construct a randomized experiment within an interrupted time 
Series framework by randomly assigning when the treatment is introduced (Edg- 
ington 1987; chap, 10). These are a special type of “single subject design." — 

14. As this is being written, a randomized experiment is actually the design 
anticipated, Among a set of judges who volunteer for the workshops, a random 
half will have their workshops postponed six months. During those six nm 
the first set of Judges is a legitimate experimental group and the second set o 
judges is a legitimate control group. And it is the sentencing patterns for the pu 
groups over those six months from which treatment effects will be apa d 

15. The good news is that only variables related to the assignment n zs 
outcome need be included. Variables affecting only the assignment or only t 
Outcome need not be included to obtain unbiased estimates. — The sen- 
6. Sometimes pretest measures are necessary control variables. tenter 
tence given to a convicted felon may depend on the length ofan earlier sen R 
the earlier sentence is a pretest measure for the sentencing pred differences 
sometimes pretest measures can be used to control for unobserved di 


: I : ity”). 
tween units [sometimes called “unobserved heterogeneity”) 


Some Final Observations 


The field of evaluation research is scarcely out of its infancy as 
à social scientific activity. The first large-scale field experi- 
ments were initiated in the mid-1960s. Concern for large-scale 
national evaluations of social Programs had their origins in the 
War on Poverty. The art of designing large-scale implementation 
and monitoring studies is still evolving rapidly. Concern with 
the scientific validity of qualitative research has just begun. AS 
part of all this, the demand for Sound program evaluations con- 
tinues to grow. 

In this context, perhaps the best overall message is to keep 
evaluations as simple as possible. Simple programs will typi- 
cally be hard enough to design and field. Simple research 
designs usually will be sufficiently demanding. And simple data 
analyses will likely tax the best evaluators available. Put 
another way, there is no such thing as a routine evaluation. 
Adding unnecessary complexity to the burden is to turn à 
' promising opportunity into almost certain disaster. 


Simplicity, however, is not enough. It is also important tO 
think defensive 
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example, it is typically useful to get, in writing, all significant 
understandings between the evaluator and program administra- 
tors (e.g., the definition of a successful outcome). Even under 
the best of circumstances and with the best of intentions, 
organizational memories can be very short. Likewise, it is essen- 
tial that quality control procedures be introduced for all facets 
of data collection: sampling, measurement, data entry, and the 
like. Indeed, it is often prudent to allocate as much as 20 percent 
of one’s evaluation research budget to data quality control. And, 
before diving into a fancy statistical analysis, it is essential to 
carefully inspect the data for errors of various sorts that will 
almost certainly be present. This means not just a search for iso- 
lated “outliers,” but internal consistency checks for anomalous 
relationships among key variables. 

Finally, there is no recipe. Prescriptions for “successful” 
evaluations are, in practice, prescriptions for failure. The tech- 
niques that evaluators may bring to bear are only tools, and even 
the very best of tools do not ensure a worthy product. Just as for 
any craft, there is no substitute for intelligence, experience, per- 
Severence, and a touch of whimsy. 


Appendix 


Guide to Literature on, Professional 
Associations of, and Organizations 
Engaged in Evaluation Research 
and Social Policy Research in the 
United States 


"The complete policy researcher" must be a knowledgeable 
methodologist, a creative theoretician, a capable manager, 
and a skilled politician. This is not the job description for 
a graduate student who is never quite able to understand 
what internal validity is all about, whose conception of the- 
ory is limited to a terminological maze that makes no 
claims about how things "work" and whose administrative 
Skills are taxed by managing to get a dissertation typed and 
turned in on time. Herbert L. Costner, "Commentaries" in 
Demerath, Larsen, and Schuessler, eds., Social Policy and 
Sociology, (New York: Academic Press, 1975, p. 262) 


I. Some General References 


The books and journals devoted largely to evaluation 
research methods and to evaluation studies have increased 
considerably in the past decade. Listed below are some of 
the major general references of which you should be aware 
if you want to become knowledgeable about evaluation 
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research theory and practice. The commentaries after each 
reference may be used as a guide to content. 


A. Evaluation Journals 


Evaluation Review: A Journal of Applied Social Research. (For- 
merly Evaluation Quarterly.) Regarded as the best of the puse 
sional journals. Biased toward quantitative and forma 
approaches. Published bimonthly by Sage Publications. Interdis- 
ciplinary, often technical, and always of high quality. 


Evaluation News. An official publication of the American Evalu- 
ation Association, formerly published by the Evaluation Network 
(see organizations, below). Published quarterly by Sage. Contains 
mainly short articles primarily addressed to professional issues 
and to substantive evaluation problems. Contains a useful set of 
short reviews of new publications in evaluation. Tends to favor 
more qualitative evaluation styles. 


New Directions for Program Evaluation. Quarterly journal of the 
American Evaluation Association (formerly the Evaluation 
Research Society) published by Jossey-Bass. Mainly special issues, 
some based on annual meetings of the society. 


Evaluation and Program Planning. Independent quarterly 
specializing in evaluations of human services programs, espe 
cially mental health programs. Now officially the journal of the 


Eastern Evaluation Research Society, a regional affiliate of the 
American Evaluation Association. 


Evaluation Studies Review Annual. Annual collection of m 
“best” articles and unpublished pieces on evaluation methods S 
findings. Published by Sage and edited by editors separately 


picked for each annual. Quality variable but some issues até 
extremely good. 


Policy Analysis. Quarterly published by the University of core 
nia Press and edited by Berkeley's Public Policy School. Large!¥ 


M Hd i n 
devoted to policy analysis although there are many articles O' 
evaluations. 


Journal of Policy Analysis and Management. Published quarterly 
by John Wiley and edited at Harvard's Kennedy School, this 


probably the best policy analysis journal going. Contains 800 
reviews of recent literature. 
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B. Sometime Evaluation Journals 


These are journals in which evaluations and related policy 
research issues often appear, but not consistently. 


Human Organization. Journal of the Society for Applied Anthro- 
pology. 

Social Problems. Journal of the Society for the Scientific Study of 
Social Problems. 


Journal of Social Issues. Journal of the Society for the Psychologi- 
cal Study of Social Problems (an affiliate of the APA). 


Journal of Applied Psychology. Although heavy on industrial psy- 
chology, occasional articles on evaluation appear. 


Journal of Human Resources. Devoted largely to issues in labor 
€conomics and training. 


Medical Anthropology. Devoted to cultural anthropology studies 
of medical problems and medical care. 


Health and Human Behavior. Published by the American iPad 
logical Association, occasionally containing evaluation studies. 


Social Science Research. Contains many articles on evaluation 
issues and studies. 

A i b- 
American Journal of Public Health. Journal of the A 
lic Health Association, routinely contains evaluations o 
Services organizations. 


In addition, from time to time, the mainline professional sad 
will publish articles on evaluation, especially on epistemologic 
and technical issues. 


C. Major General References On Evaluation 
Note: Especially important references are 
marked with * *. 


Bennett, A. and A. Lumsdaine. 1975. Evaluations and Experi- 


ments. Academic Press. a . 
Excellent (although a little old) compilation of papers on field experi- 
ments evaluating innovative programs. 
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Campbell, D. T. and J. Stanley. 1967. Experimental and Quasi- 
Experimental Designs for Research. Rand McNally. 
A classic that has dominated the evaluation research design literature 
since publication. Concerned primarily with educational evaluations 
but very general. 


**Cronbach, L. J. 1982. Designing Evaluations of Educational 
and Social Programs. Jossey-Bass. 


An excellent text advancing a counter-Campbellian perspective that 
makes a great deal of sense. 


**Cook, T. and D. T. Campbell. 1979. Quasi-Experimentation. 
Rand McNally. 


Excellent exposition of research designs used commonly in evalua- 
tions by two of the best practitioners of the art of exposition. Some 
what removed from practice, however. 


Cronbach, L. J. (and associates). 1980. Toward Reform of Program 
Evaluation. Jossey-Bass. 
A founding father of the field and a large cast of associates at Stanford 
ruminate over the faults of program evaluation and suggest reforms in 


the form of 95 “theses.” Sensible suggestions although ponderously 
Written. 


Franklin, J. L. and J. H. Thrasher. 1976. An Introduction to Pro- 
gram Evaluation. John Wiley. 


An elementary introduction to program evaluation in the public 
health and health delivery fields, 


Guba, E. and Y. Lincoln. 1982, Effective Evaluation. Jossey-Bass- 


Advocates of naturalistic, responsive evaluations. Perhaps the best 
case made for qualitative approaches to evaluation. 


Guttentag, M. and E. Struening, eds. 1975. Handbook of Evalua- 
tion Research. 2 vols, Sage. 
Although very much out of date, these two volumes are quite compre 
hensive in their Coverage of major issues and substantive applications. 
Most of the chapter contributions (by quite well-known authors) were 
written in the late 1960s—just as the field began to flower. 


House, E. R. 1982, Evaluating with Validity. Sage. 


Long essay on evaluations that can be used to improve programs, espe" 
cially educational ones. T 


akes an anti-social-science viewpoint. j 
Judd, C. M. and D. S. Kenny. 1981. Estimating the Effects of Socia 
Interventions. Cambridge University Press. 
" i ive 
A useful, if somewhat dated, survey of approaches to the quantitatiV' 
assessment and estimation of net impacts of social programs. 


"D 


Appendix 105 


Morris, L. L., ed. 1988. Program Evaluation Kit. 2nd ed., 8 vols. 
Sage. 
A set of cookbooks written to help local agencies carry out evaluation 
studies; mainly oriented toward local educational agencies. Simply 
written and quite good, but not very sophisticated technically. Do not 
use on evaluations that count. 


Patton, M. 1980. Qualitative Evaluation Methods. Sage. 
Strong advocacy of qualitative approaches, but not the best reasoned. 


Riecken, H. and R. Boruch, and associates. 1974. Social Experi- 
mentation. Academic Press. 
Outcome of an SSRC committee on social experimentation. Excellent 
and simply written review of the major issues (as understood in the 
early 1970s) in the design and conduct of large-scale social exper- 
iments. 

**Rossi, P. H. and H. Freeman. 1985. Evaluation: A Systematic 
Approach. 4th ed. Sage. 
An excellent introduction to evaluation research, sprinkled through- 
out with many examples. Probably the best text around for beginners; 
well written. 

Rossi, P. H. and W. Williams. 1974. Evaluating Social Programs. 
Academic Press. 
Outgrowth of a 1969 conference on evaluation. Excellent papers but 
out of date. Should be read out of antiquarian interest. 

Scriven, Michael. 1980. Evaluation Thesaurus. 3rd ed. Scarecrow. 
[private press owned by Scriven). 
Scriven's views on evaluation given in the guise of a dictionary of 
evaluation terms. Written with grace and skill. 

Scriven, M. 1980. The Logic of Evaluation. Scarecrow. 
An eccentric but very literate and amusing review of evaluation as an 
enterprise that must bend to fit the needs of the program being 
evaluated. 

Suchman, E. A. 1967. Evaluation Research. Russell Sage. 
Old but very good. The best of the early comprehensive reviews of the 
field. Mainly addressed to the public health field. 

Wholey, J. S. 1979. Evaluation: Promise and Performance. Urban 
Institute. 


Heavy emphases on program monitoring and evaluability assess- 
ments, especially of established programs; can also be regarded as a do- 
it-yourself advocate. 
NOTE: A few publishing houses—Sage, Jossey-Bass, and Academic Press— 
dominate the publication of evaluation-oriented books and texts. See their cata- 
logues for long lists of general references in the field. 
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D. Some Major Technical Reference Journals 


Journal of the American Statistical Association (JASA). An 
excellent diverse journal that contains an "applications" section 
that is especially of interest to evaluators and applied social scien- 
tists. Articles are often quite difficult. 


Journal of the Royal Statistical Society. The British counterpart 


of JASA and very similar in content and style, but in three sepa- 
rate series. 


Econometrica. Journal of the Econometrics Society; publishes 
articles on statistical issues, substantive problems, and economic 


theory. Useful (and often difficult] for the model-building side of 
evaluation research. 


Journal of Econometrics. Much like Econometrica. 


Psychometrika. Journal of the Psychometric Society; focuses on 


measurement issues and the analysis of observational data. Often 
quite difficult. 


Biometrika. Journal of the Biometric Society, publishing articles 
On statistical issues in the biological and health sciences. Often 
contains interesting articles on true experiments and approxima- 
tions of them, but often quite difficult. 


Biometrics. Much like Biometrika. 


Technometrics. Journal of the American Society for Quality Con- 
trol; publishes articles on statistical applications in the physical, 
chemical, and engineering fields. Has a lot of good materials for 
social scientists, especially on research design. Often very difficult. 


Journal of Educational Statistics. Publishes articles on statistical 


applications in educational research. Usually more accessible 
than the blank-metrics journals. 


Sociological Methodology. At one time published by the Ameri- 
can Sociological Association, an annual volume of solicited an 
contributed pieces on methodological issues in sociologica 
research. Uneven in quality and relevance to evaluation. Didactic 
and review articles are often quite good. 


Sociological Methods and Research. Quarterly devoted to socio" 
logical research issues. Currently struggling to fill each issue an 4 
as a result, quality is often marginal. There are, however, alway 
some articles relevant to evaluation research. 
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E. Some General Technical References 


These are books that contain expositions of the statistical tech- 
niques and research designs used in evaluation research. Usually 
these also contain discussions of other procedures as well. 


Achen, Christopher. 1986. The Statistical Analysis of Quasi- 
Experiments. University of California. 

An excellent discussion of what are the appropriate statistical meth- 
ods to apply to quasi-experiments. Read only in an optimistic mood. 

Belsey, David A. et al. 1980. Regression Diagnostics: Identifying 
Influential Data and Sources of Collinearity. John Wiley. 

An excellent discussion of some important things that can go wrong 
in multivariate analyses and some of the things you can do about it 
[Sometimes]. 

Box George E. P. and Gwilym M. Jenkins. 1976. Time Series Anal- 
ysis Forecasting and Control. Rev. ed. Holden-Day. 

The granddaddy reference on modeling time series and still good. 

Box, George E. P. et al. 1978. Statistics for Experimenters. John 
Wiley. 

An integrated discussion of randomized experiments and analysis of 
variance in a very accessible form. 

Chiang, A. C. 1974. Fundamental Methods of Mathematical Eco- 
nomics. McGraw-Hill. 4 
An accessible and excellent reference for applied mathematics [e.g 
calculus, matrix algebra] in the social sciences. 

Cochran, William G. 1977. Sampling Techniques. 3rd ed. 
Wiley. , 
The classic text revised and still excellent, containing sampling theory. 

Cochran, William G. 1983. Planning and Evaluation of Observa- 
tional Studies. John Wiley. 

Excellent discussion of nonexperimental research procedures. 

Cox, D. R. 1958. Planning Experiments. John Wiley. 

Still an excellent treatment of randomized experiments. 

Dillman, Don A. 1978. Mail and Telephone Surveys: The Total 
Design Method. John Wiley. 

A real cookbook for the conduct of mail and telephone surveys that get 
very high response rates. Especially good on mail surveys. 

Efron, Bradley. 1982. The Jackknife, the Bootstrap and Other 
Resampling Plans. Society for Industrial and Applied Mathe- 
matics. 

Still the best overall treatment of resampling procedures, but pithy and 
a bit dated. 


John 
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Freedman, David A. et al. 1978. Statistics. Norton. 


Many argue that this is the best introduction to applied statistics 
around. Even seasoned researchers find it instructive. Very accessible. 
New edition due in 1990. 


Glass, G. V. et al. 1981. Meta-Analysis in Social Research. Sage. 
Exposition of methods for aggregating and assessing the results of 
many evaluations. 

Granger, CW.J. 1980. Forecasting in Business and Economics. 
Academic Press. 

A wonderful and accessible introduction to forecasting. 

Granger, CWJ. and P. Newbold. 1986. Forecasting Economic 
Time Series. 2nd ed. Academic Press. 

An excellent but relatively advanced treatment of forecasting with 
good material on intervention analysis. New edition due in 1990. 
Groves, Robert M. and R. L. Kahn. 1980. Surveys by Telephone. 

Academic Press. 
Excellent discussion of random-digit-dialing methods of telephone 
surveys and their advantages, by two SRC survey experts. 

Hanushek, E. A. and J. E. Jackson. 1978. Statistical Methods for 
Social Scientists. Academic Press. 

A very good intermediate econometrics text with lots of examples 
from sociology and political science. Written for noneconomists. 


Harvey, A. C. 1981. The Econometric Analysis of Time Series. 
Halsted. 


Probably the best text on the analysis of time series from an economet- 
ric point of view. 
Hoaglin, David C. et al. 1982. Data for Decisions. Abt Books. 


A good introduction to how data should be used to make policy deci- 
sions. Very accessible. 


Hsiao, C. 1986. Analysis of Panel Data. Cambridge University 
Press. 


Probably the most recent and thorough treatment of the analysis of 
panel designs, but not easy going. 


Judge, George G. et al. 1985. The Theory and Practice of Econo 
metrics. John Wiley. 
Perhaps the most wide ranging of the current econometrics texts: 
Many topics covered and covered well. 

Kazdin, A. E. 1982. Single Case Research Designs: Methods al 
Clinical and Applied Settings. Oxford University Press. 


d É i in 
Innovative attempt to quantify the study of single cases, mainly 
clinical settings. 
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Kish, Leslie. 1965. Survey Sampling. John Wiley. 

An old, somewhat out-of-date, but excellent advanced text on the sam- 
pling of human populations. 

Kish, Leslie. 1987. Statistical Designs for Research. John Wiley. 
Excellent discussions (although uneven) of experimental and quasi- 
experimental designs. 

Lawless, Jerald F. 1982. Statistical Models for Lifetime Data. John 
Wiley. 

Everything you wanted to know about "life history" data and how to 
handle them, from the biomedical tradition in which "survival analy- 
sis” was invented. 

Light, Richard J. and D. B. Pillemer. 1984. Summing Up: The Sci- 
ence of Reviewing Research. Harvard University Press. 

An excellent exposition of how to avoid biases in summarizing the 
results of many studies. , 
Little, R.J.A. and D. B. Rubin. 1987. Statistical Analysis with 

Missing Data. John Wiley. 

An excellent and current treatment of missing data problems, 
fairly difficult and sometimes promising more than it delivers. Except 
for relatively simple cases, there is no real fix for missing data. 

Maddala, G. S. 1983. Limited-Dependent and Qualitative Varia- 
bles in Econometrics. Cambridge University Press. — 
An excellent but demanding treatment of how to handle "unfriendly 
dependent variables. Getting slightly dated, but still very useful. 


McCullagh, P. and J. A. Nelder. 1983. Generalized Linear Models. 


Chapman and Hall. 


An excellent treatment of an overarc 
statistical procedures used in evaluation research can 
what demanding but worth the effort. 


Miles, Mathew and A. Michael Huberman. 19 


Data Analysis. Sage. Ts -— 
An interesting discussion of how to treat qualitative data A: om 
“fieldwork” in a rigorous way. Examples used are largely qualitative 
evaluations. 

Mishan, E. J. 1976. Cost-Benefit Analysis. 2nd ed. Praeger. 

The full treatment; very difficult without some background in eco- 
nomics. 

Mosteller, Frederick et al. 1983. Beginning Statistics with Data 
Analysis. Addison-Wesley. 
An excellent introductory stai 
to newer descriptive techniques. Also favors 
view of statistics. 


but 


hing framework in which most 
be placed. Some- 


84. Qualitative 


tistics book with large sections devoted 
a "robust" (i.e., cautious) 
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Morrison, D. 1976. Multivariate Statistical Methods. 2nd ed. 
McGraw-Hill. 
An excellent intermediate text on multivariate statistical methods 
popular in education and psychology. 

Pindyck, R. S. and D. L. Rubinfeld. 1981. Econometric Models and 
Economic Forecast. 2nd ed. McGraw-Hill. 

Perhaps the best intermediate text in econometrics with excellent 
treatments of simulations and univariate Box-Jenkins procedures. 
Pollard, W. E. 1986. Bayesian Statistics for Evaluation Research: 

An Introduction. Sage. 


If Bayesian statistics is the wave of the future, this is a good first board 
to ride. 


Rossi, P. H., J. D. Wright, and A. B. Anderson, eds. 1983. Hand- 
book of Survey Research. Academic Press. 


A good collection of fairly technical papers on sampling, survey ques 
tionnaire writing, measurement, and analysis problems in sample sur 
veys; not for the beginner, however. 


Rousseeuw, P. J. and A. M. Leroy. 1987. Robust Regression & Out- 
lier Detection. John Wiley. A 
An excellent and accessible treatment of robust regression, including 


the very newest techniques. Can be purchased with user-friendly soft- 
ware that does many of the techniques described. 


Sudman, Seymour. 1976. Applied Sampling. Academic Press. 
An excellent introduction to population sampling from a practical pet 
Spective. (Not for persons looking for sampling theory. 

Sudman, Seymour and Norman M. Bradburn. 1982. Asking Ques- 
tions: A Practical Guide to Questionnaire Design. Jossey-Bass- 
Just what the title says it is, The best cookbook yet with plenty of 
examples. 


Thompson, M. S. 1980. Benefit Cost Analysis for Program Evalu- 
ation. Sage. 


A very accessible introduction to cost-benefit analysis used in evalua: 
tion of programs. 


ORGANIZATION OF THE DISCIPLINE 


A. Professional Societies (and subsocieties) 


American Evaluation Association. An amalgam of the Evaluatio” 
Research Society and the Evaluation Network formed in 1985. p 
entries below for further information on its constituent parts. Ever 
thing below that applies to ERS or EN now applies to AEA. 
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Evaluation Research Society. Annual meeting in October or Novem- 
ber. Publishes New Directions in Evaluation (quarterly, see above) 
plus sponsors monographs. Composed primarily of psychologists 
and sociologists and heavily oriented to human service social pro- 
grams. Annual meetings are interesting serious, and small enough 
to enjoy. Membership about 2,000-3,000. 

Evaluation Network. Smaller association of evaluators primarily 
interested in qualitative evaluations of small-scale programs in edu- 
cation and human services. 

In addition, sections of the American Psychological Association, 
American Sociological Association, the American Economic 
Association, and the American Educational Research Association 
all have sessions at their annual meetings devoted to problems of 


evaluation. 


B. Major Evaluation Research Producers 


Evaluation research is now an industry with university depart- 
ments, university research organizations, private firms with 
research branches, and private firms devoted mainly to evalua- 
tion, all producing evaluation research. In addition, some evalua- 
tion (perhaps a large proportion of all evaluations] is done within 
agencies with responsibilities for operating social programs. 

However, as in other industries, there is considerable concen- 
tration. Although perhaps as many as 1,000-2,000 entities do 
evaluation research, as much as 50% of all the funds are obtained 
by the top 10-15 largest private firms, who do most of the large- 
scale (and expensive) evaluations. Some of the largest firms have 
more social science PhDs on their payrolls than most social sci- 
ence divisions within universities. For example, at its peak in the 
1970s, Abt Associates had a staff of more than 100 PhDs and a 
Support staff of about 400 research assistants and clerks. 

Some of the largest firms are listed below: 


Abt Associates, Inc., Cambridge, MA 
The Rand Corporation, Santa Monica, 
Educational Testing Service, Princeton, 
Mathematica, Inc., Princeton, NJ 
Battelle Memorial Institute, Columbus, OH (not for profit) 

The Urban Institute, Washington DC (not for profit) 

The Mitre Corporation, McLean, VA 

Westat, Inc., Silver Springs, MD 

Research Triangle Institute, Raleigh-Durham, NC (not for profit) 


CA (not for profit) 
NJ (not for profit) 
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National Analysts, Philadelphia, PA 
American Institutes of Research, Pittsburgh, PA 


A few of the major university-affiliated research organizations 
that are also in the "big" league follow: 

Institute for Research on Poverty, University of Wisconsin 

NORC (National Opinion Research Center), University of Chicago 

Institute for Social Research, University of Michigan 

Survey Research Center, Temple University 


In addition, most of the major graduate centers in the social sci- 
ences have one or more research centers in the social sciences 
that participate in evaluation research activities. 


MAJOR SOURCES OF EVALUATION FUNDING ON 
THE NATIONAL LEVEL 


Evaluations are typically funded by sponsors who have oversight 
responsibilities for the programs in question. On the national 
level this ordinarily means that federal departments and agencies 
are the sources of funds. Often Congress incorporates mandated 
evaluations into authorizing legislation, sometimes directing an 
agency to undertake an evaluation of a specific sort. National 
evaluations are typically funded by contract let to one of the 
major producers listed above. 

The major federal agencies that frequently fund evaluations 
are as follows: 
Department of Education. Although its research budget was deci: 
mated during the Reagan administration, this department stil 
conducts some of the major national evaluations. Currently it i$ 
planning for a national impact assessment of its vocation? 
rehabilitation program. Tends to favor educational researchers 4 
evaluators. 
Department of Labor. Strong funder of evaluations concerned 
with its major man-power training, employment security (une? 
ployment insurance, job placement), and so on. Tends to favo? 
economists. 
Department of Agriculture. Funds evaluations in the fields of 
nutrition, food stamps, and school lunch programs. 
Department of Health and Human Resources. This extremely 
large agency funds evaluations mainly through its compone?" 
divisions, among which the more prominent are the followin’ 
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Public Health Service (includes National Institutes of Health, 
Center for Disease Control}, Social Security Administration, 
Health Care Finance Administration, Federal Drug Administra- 
tion, and so on. 

Department of Housing and Urban Development. Although also 
in eclipse during the Reagan regime, HUD has financed major 
Social experiments and evaluations of many of its major 
programs. 

Environmental Protection Agency. Currently concerned with the 
evaluation of its mass educational programs designed to raise 
public consciousness concerning hazardous substances. 
Department of Defense. Runs major evaluations of human 
resources programs. Currently being forced by Congress into 
evaluations of its weapons systems. 

General Accounting Office. Although it does not contract out its 
work, this agency has established an evaluation unit that under- 
takes evaluations at the request of Congress. The Division of Pro- 
gram Evaluation and Methodology now has about 50 PhD-level 
Social scientists. 

National Institute of Justice. A unit of the Department of Ju t 
that has funded several field experiments on prospective crim 
justice policies. 


stice 
nal 


IV. SOME MAJOR EVALUATION RESEARCH AND 
EXEMPLARY PUBLISHED MONOGRAPHS IN 
EVALUATION 


Each of the books cited above as general reference books contain 
extensive bibliographies of evaluation studies. Many of the refer- 
ences are to so-called fugitive documents (ie., those not dis- 
tributed by major publishers or published in easily accessible 
scholarly journals) and hence are difficult to locate in conven- 
tional university libraries. Some of the better ones that have been 
published in accessible form are listed below. 

If you become stricken by a passion for evaluation, we suggest 
that you begin early to build your own library of fugitive docu- 
ments. Many such documents, especially relating to studies that 
have been financed by federal agencies, are available in microfilm- 
xeroxed form through NTIS (National Technical Information Ser- 
vice) or ERIC (a computerized reference service supported by the 
Department of Education). Evaluation News (see journals above) 
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contains a section on ongoing evaluation projects and recently 
issued reports. 

A d university social science reference librarian can be of 
immense help in locating studies and arranging for access. T 

Although most evaluation research or comments on eva ua 
tions are never published by commercial or university lcge 
some of the best ones and some of those that are concerned wit 
major evaluation studies do get published. Listed below are some 
of the ones that we think are either excellent and/or concern 
major programs. 


Berk, R. A. et al. 1981. Water Shortage: Lessons in Conservation 
from the Great California Drought, 1967-1977. Abt Books. ; 
An analysis of the impact of water conservation programs in Cali- 
fornia. 

Bradbury, K. L. and A. Downs, eds. 1981. Do Housing Allowances 
Work? Brookings Institution. 

Collection of essays evaluating the Housing Allowance Experiments 
conducted by Abt Associates and the Rand Corporation. 

Bunker, J. P., B. A. Barnes, and F. Mosteller. 1977. Costs, Risks and 
Benefits of Surgery. Oxford University Press. 

A review of research on the relative effectiveness of surgical versus 
noninvasive procedures where there is a choice. 


Cicirelli, V. G. et al. 1969. The Impact of Head Start. Athens. Wes- 
tinghouse Learning Corporation and Ohio University. 

A very controversial first evaluation of one of the more prominent 
social programs for preschool children, 

Coleman, J. S. et al. 1966. Equality of Educational Opportunity: 
Washington, DC: Government Printing Office. í 
Needs assessment research that radically changed the direction © 
educational research. 

Coleman, J. S. et al. 1982. High School Achievement: Public. 
Catholic and Private Schools Compared. Basic Books. —— 

A controversial attempt to assess the differential effectiveness of high 
schools, purportedly finding that Catholic high school students 
achieve higher levels of math and verbal competence. 

Cook. T. 1975. Sesame Street Revisited. Russell Sage. 

Classic critique of evaluation of the children's educational TV progta™ 

Cutright, P. and F. S. Jaffe. 1979. Impact of Family Planning Pro 
grams on Fertility: The U.S. Experience. Praeger. he 
A brilliant use of demographic data and survey data to estimate t 
impact of family planning clinics in the United States. 
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Davidson et al. 1981. Evaluation Strategies in Criminal Justice. 
Pergamon. 

An account of the failure of an evaluation of juvenile justice programs 
in Michigan and a frank account of the sources of that failure. 

Fairweather, G. W. and L. G. Tornatzky. 1977. Experimental 
Methods for Social Policy Research. Pergamon. 

One of the best examples of the use of sophisticated evaluation 
research to design and refine a program for the successful reintegration 
into noninstitutional life of persons discharged from mental hospitals. 

Friedman, D. and D. H. Weinberg. 1982. The Economics of Hous- 
ing Vouchers. Academic Press. 

Collection of papers on the Housing Allowance Experiments. 

Gleser, G. C. et al. 1981. Prolonged Psychosocial Effects of Dis- 
aster: A Study of Buffalo Creek. Academic Press. 

An attempt by a group of social scientists to estimate the residual psy- 
chological effects of the Buffalo Creek disaster in which a dam burst 
and wiped out a small West Virginia community. Extremely skillful. 

Graham, John D., ed. 1988. Preventing Automobile Injury: New 
Findings from Evaluation Research. Auburn House. 

Series of reports on impact of seat belt, drinking, and speed limit pro- 
grams on automobile accident rates. 

Gramlich, E. M. and P. P. Koshel. 1975. Educational Performance 
Contracting; An Evaluation of an Experiment. Brookings 
Institution. 

A reanalysis of a pilot test of a prog 
certain subjects in high schools. l 

Kassebaum, G. et al. 1971. Prison Treatment and Parole Survival. 
John Wiley. 
Classic controlled experiment evaluating the effectiveness of a group 
therapy program in California prisons. 

Kelling, G. T. et al. 1974. The Kansas City Patrol Experiment. The 
Police Foundation. : 

Controlled field experiment on police patrolling strategies. f 

Kershaw, D. and J. Fair. 1976. The New Jersey-Pennsylvania 
Income Maintenance Experiment. Academic Press. 

Narrative account of the first large-scale income maintenance field 


controlled experiment. 
McLaughlin, M. 1975. Evaluation and Reform: The Elementary 


and Secondary Education Act of 1965. Ballinger. ; 
An account of the failure of attempts to evaluate e o! 
the impact of this federal legislation on the education of disadvan- 


taged children. 


ram to contract out the teaching of 
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Mielke, K. W. and J. W. Swinehart. 1977. Evaluation of the Feeling 
Good Television Series. Children's Television Workshop. 
A famous evaluation of an educational television program that led to 
the program being canceled. 


Milavsky, J. R., R. C. Kessler, H. H. Stipp, and W. S. Rubin. 1982. 
Television and Aggression: A Panel Study. Academic Press. 
An extremely skillful attempt to estimate the effects of watching vio- 
lence on TV on the aggressive behavior of young schoolchildren. 

Nathan, R. P. et al. 1983. The Consequences of Cuts. Princeton 
Urban and Regional Research Center. 


ee . al 
A qualitative attempt to assess the Reagan regime's effects on loc 
urban programs. 


Peirce, W. S. 1981. Bureaucratic Failure and Public Expenditure. 
Academic Press. 

A review of the effectiveness of public programs of all sorts and the 
development of a theory for explaining why they fail. 

Pressman, J. L. and A. B. Wildavsky. 1973. Implementation. 
University of California Press. 

A description of how an important program was implemented 
improperly. 

Raizen, S. A. and P. H. Rossi, eds. 1981. Program Evaluation in 
Education: When! How! To What Ends! National Academy 
Press. 

The report of a National Academy of Science Committee chat 
reviewed the evaluation program of the Department of Education. . 

Robins, P. K. et al., eds. 1980. A Guaranteed Annual Income: Evi 

dence from a Social Experiment. Academic Press. 


: : i e 
Reports on the income maintenance experiments conducted in dedit 
and Denver. Probably the best of the randomized field experiments 


the 1970s. 

Rossi, P. H., R. A. Berk, and K. Lenihan. 1980. Money, Work and 
Crime: Experimental Evidence. Academic Press. 5 
Report of a large-scale randomized field experiment with peni 
released from the prisons of Texas and Georgia, the treatment bein 
eligibility for unemployment compensation payments. T 

Rossi, P. H. and K. Lyall. 1975. Reforming Social Welfare. Russe 
Sage. we 
An assessment of the New Jersey-Pennsylvania Income Maintenar' 
Experiment. 

` : ic 

Rossi, P. H., J. D. Wright, E. Weber-Burdin, and J. Pereira. 1983. p 
tims of the Environment: Losses from Natural Hazards in 
United States: 1970-1980. Plenum. 
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A “needs assessment” of the losses suffered by households in the 
United States over a decade, outlining the problems that are not 
affected by U.S. natural hazards policies of relief. 

Smith, M. L., G. V. Glass, and T. I. Miller. 1980. The Benefits of 
Psychotherapy: An Evaluation. Johns Hopkins University 
Press. 

A meta-evaluation that summarizes and puts together several hundred 
evaluations of the effectiveness of psychotherapy. 

Struyk, R. J. and M. Bendick, eds. 1981. Housing Vouchers for the 


Poor. Urban Institute. 
Another set of articles summarizing the findings of the Housing 


Allowance Experiments run by Abt and Rand. 

Vanecko, J. J. and B. Jacobs. 1970. Reports from the 100-City Cap 
Evaluation: The Impact of the Community Action Program 
on Institutional Change. National Opinion Research Center. 
A description of the local community action programs financed by the 
federal government in the 1960s. 

Williams, W. 1980. Government by Agency: Lessons from the 
Social Program Grants in Aid Experience. Academic Press. 
A qualitative assessment of the impact of block grants on local 
programs. 

Wilner, D. M., R. P. Walkely, 
1962. The Housing Environment an 
kins University Press. l ] Y 
A classic evaluation of the effects of public housing on households. 

Wright, J. D. et al. 1979. After the Clean-Up: The Long Range 


Effects of Natural Disasters. Sage. x 
An evaluation of the long-range effects of natural hazard events (floods, 


hurricanes, and tornadoes) on growth trends in local communities. 


T. C. Pinkerton, and M. Tayback. 
d Family Life. Johns Hop- 
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